top_block.unlock() deadlock when flow graph contains Python sync_block
|Assignee:||Johnathan Corgan||% Done:|
When a flow graph contains a sync_block written in Python, calling top_block.lock() and then top_block.unlock() always freezes.
Simple debugging shows the hang is caused by a thread calling sem_wait().
The attached file contains a simple repro.
#2 Updated by Johnathan Corgan 5 months ago
When unlock() is called, the flowgraph is stopped (in order to implement any recongfiguration done while locked). This issues an thread interrupt to all the flowgraph threads, and the wait() function joins all these threads. The problem is the worker thread that is handing the Python-based block is never returning from join(), and the call to wait() (and thus unlock() ) never finishes.
This may be related to the handling of the Python GIL (Global Interpreter Lock). When calling up into Python from C++, the Python GIL must be acquired before executing the Python work function, and released on exit. So it might be the case that the thread is in an uninterruptible state while doing this. Still investigating.
#3 Updated by Tom Rondeau 5 months ago
- Status changed from Assigned to Feedback
The bug occurs during the call to "stop" in gr::block_gateway_impl.cc, specifically when calling _handler->calleval(0). If you comment this line out, the above program will finish.
The calleval(0) line is calling into py_feval.h, gr::py_feval_ll::calleval and blocking on the line:
Apparently, PyGILState_Ensure() is never returning. I have not been able to figure out why. This same code works fine during a direct call to 'stop' but not through 'unlock'. I cannot see where the GIL is being acquired at any time before this call (and not being released), and there is no indication in the Python docs that this call should ever block like this.
This suggests that there is something different in the path through a call to tb.stop to this stage and a call to tb.unlock.