Bug #594

top_block.unlock() deadlock when flow graph contains Python sync_block

Added by Joshua Lackey almost 2 years ago. Updated 5 months ago.

Status:AssignedStart date:09/20/2013
Priority:NormalDue date:
Assignee:Johnathan Corgan% Done:

10%

Category:gnuradio-runtime
Target version:release-3.8.0
Resolution:

Description

When a flow graph contains a sync_block written in Python, calling top_block.lock() and then top_block.unlock() always freezes.

Simple debugging shows the hang is caused by a thread calling sem_wait().

The attached file contains a simple repro.

unconnect_test.py Magnifier (1.16 KB) Joshua Lackey, 09/20/2013 09:12 pm

0001-runtime-WORKAROUND-for-deadlock-when-un-locking-in-p.patch Magnifier - workaround for this bug (1023 Bytes) Jiří Pinkava, 03/12/2015 08:21 am

0002-runtime-fix-deadlock-when-calling-.unlock-from-Pytho.patch Magnifier (2.57 KB) Jiří Pinkava, 03/15/2015 02:54 pm

History

#1 Updated by Johnathan Corgan almost 2 years ago

  • Category set to gnuradio-runtime
  • Status changed from New to Assigned
  • Assignee set to Johnathan Corgan
  • Priority changed from Normal to High

#2 Updated by Johnathan Corgan almost 2 years ago

When unlock() is called, the flowgraph is stopped (in order to implement any recongfiguration done while locked). This issues an thread interrupt to all the flowgraph threads, and the wait() function joins all these threads. The problem is the worker thread that is handing the Python-based block is never returning from join(), and the call to wait() (and thus unlock() ) never finishes.

This may be related to the handling of the Python GIL (Global Interpreter Lock). When calling up into Python from C++, the Python GIL must be acquired before executing the Python work function, and released on exit. So it might be the case that the thread is in an uninterruptible state while doing this. Still investigating.

#3 Updated by Tom Rondeau almost 2 years ago

  • Status changed from Assigned to Feedback

The bug occurs during the call to "stop" in gr::block_gateway_impl.cc, specifically when calling _handler->calleval(0). If you comment this line out, the above program will finish.

The calleval(0) line is calling into py_feval.h, gr::py_feval_ll::calleval and blocking on the line:

ensure_py_gil_state _lock;

Apparently, PyGILState_Ensure() is never returning. I have not been able to figure out why. This same code works fine during a direct call to 'stop' but not through 'unlock'. I cannot see where the GIL is being acquired at any time before this call (and not being released), and there is no indication in the Python docs that this call should ever block like this.

This suggests that there is something different in the path through a call to tb.stop to this stage and a call to tb.unlock.

#4 Updated by Johnathan Corgan 10 months ago

  • Status changed from Feedback to Assigned
  • Priority changed from High to Normal
  • Target version set to release-3.8.0

#5 Updated by Jiří Pinkava 5 months ago

The reason why it blocks is simple. .unlock() is called from Python, the block also runs in Python. Python cannot run 2 threads concurently. Thread running the block cannot finish becase is blocked by "main" thread calling join_all().

Simple WORKAROUND (not fix!) is bellow.

Proper fix must ensure all threads using Python finished before calling .unlock(), unfortunately this requires skill which I does not have yet. I hope this comment will help to fix this one soon.

#6 Updated by Jiří Pinkava 5 months ago

I forgot to mention I added extra sleep before .unlock() in test script to ensure all threads finish. Withought sleep might not work well or exhibit random failures (because it is workaround not fix).

#7 Updated by Kevin Reid 5 months ago

I just noticed this bug (since it was recently updated) but I've hit it myself before.

I admit I'm not familiar with the actual details of Python threading/GIL, but my understanding was that it's possible to release the Python GIL around specific FFI calls, and this is done some places in GR with "GR_PYTHON_BLOCKING_CODE".

If this is true, and the problem is correctly understood, then all that needs to be done is add GR_PYTHON_BLOCKING_CODE (and the matching Python shims) for lock() and unlock(), in addition to the operations it already applies to (start stop run wait), in gnuradio-runtime/swig/top_block.i and gnuradio-runtime/python/gnuradio/gr/top_block.py.

#8 Updated by Jiří Pinkava 5 months ago

Use of GR_PYTHON_BLOCKING_CODE in top_block.i really does the trick. Any GR code should be thread safe by design (have its-own locking mechanism and does not do anything Python related) and the Python part is simple wrapper not doing anything Pythonic.

There is many pitfalls in thread programing and in C-Python programing too. I did some checking but someone more experienced should see this.

Attached patch is intended for testing, I have also some simple QA code (based on test from previous post) and if this will be considered as a correct fix I will make pull request throught gitbub.

#9 Updated by Jiří Pinkava 5 months ago

With the previous patch I hit another related issue. The wait() used in restart() is the same wait() used for waiting for block completion after start() This causes early exit from python call of top_block.wait(). This behaviour was probably not observed because of consecutive calls of stop() wait() start(). This might be related to another issue i saw elsewhere described as random end of work when lock()/unlock() was used.

I will try to fix this latter, but previous patch alone does not provide complete solution. Probably some extra locking (checking for lock()/unlock() in wait should fix this, anny other sugestions?

Also available in: Atom PDF