top_block.unlock() deadlock when flow graph contains Python sync_block
|Assignee:||Johnathan Corgan||% Done:|
When a flow graph contains a sync_block written in Python, calling top_block.lock() and then top_block.unlock() always freezes.
Simple debugging shows the hang is caused by a thread calling sem_wait().
The attached file contains a simple repro.
#2 Updated by Johnathan Corgan almost 2 years ago
When unlock() is called, the flowgraph is stopped (in order to implement any recongfiguration done while locked). This issues an thread interrupt to all the flowgraph threads, and the wait() function joins all these threads. The problem is the worker thread that is handing the Python-based block is never returning from join(), and the call to wait() (and thus unlock() ) never finishes.
This may be related to the handling of the Python GIL (Global Interpreter Lock). When calling up into Python from C++, the Python GIL must be acquired before executing the Python work function, and released on exit. So it might be the case that the thread is in an uninterruptible state while doing this. Still investigating.
#3 Updated by Tom Rondeau almost 2 years ago
- Status changed from Assigned to Feedback
The bug occurs during the call to "stop" in gr::block_gateway_impl.cc, specifically when calling _handler->calleval(0). If you comment this line out, the above program will finish.
The calleval(0) line is calling into py_feval.h, gr::py_feval_ll::calleval and blocking on the line:
Apparently, PyGILState_Ensure() is never returning. I have not been able to figure out why. This same code works fine during a direct call to 'stop' but not through 'unlock'. I cannot see where the GIL is being acquired at any time before this call (and not being released), and there is no indication in the Python docs that this call should ever block like this.
This suggests that there is something different in the path through a call to tb.stop to this stage and a call to tb.unlock.
#5 Updated by Jiří Pinkava 5 months ago
- File 0001-runtime-WORKAROUND-for-deadlock-when-un-locking-in-p.patch added
- % Done changed from 0 to 10
The reason why it blocks is simple. .unlock() is called from Python, the block also runs in Python. Python cannot run 2 threads concurently. Thread running the block cannot finish becase is blocked by "main" thread calling join_all().
Simple WORKAROUND (not fix!) is bellow.
Proper fix must ensure all threads using Python finished before calling .unlock(), unfortunately this requires skill which I does not have yet. I hope this comment will help to fix this one soon.
#7 Updated by Kevin Reid 5 months ago
I just noticed this bug (since it was recently updated) but I've hit it myself before.
I admit I'm not familiar with the actual details of Python threading/GIL, but my understanding was that it's possible to release the Python GIL around specific FFI calls, and this is done some places in GR with "
If this is true, and the problem is correctly understood, then all that needs to be done is add
GR_PYTHON_BLOCKING_CODE (and the matching Python shims) for
unlock(), in addition to the operations it already applies to (start stop run wait), in
#8 Updated by Jiří Pinkava 5 months ago
Use of GR_PYTHON_BLOCKING_CODE in top_block.i really does the trick. Any GR code should be thread safe by design (have its-own locking mechanism and does not do anything Python related) and the Python part is simple wrapper not doing anything Pythonic.
There is many pitfalls in thread programing and in C-Python programing too. I did some checking but someone more experienced should see this.
Attached patch is intended for testing, I have also some simple QA code (based on test from previous post) and if this will be considered as a correct fix I will make pull request throught gitbub.
#9 Updated by Jiří Pinkava 5 months ago
With the previous patch I hit another related issue. The wait() used in restart() is the same wait() used for waiting for block completion after start() This causes early exit from python call of top_block.wait(). This behaviour was probably not observed because of consecutive calls of stop() wait() start(). This might be related to another issue i saw elsewhere described as random end of work when lock()/unlock() was used.
I will try to fix this latter, but previous patch alone does not provide complete solution. Probably some extra locking (checking for lock()/unlock() in wait should fix this, anny other sugestions?