Re: kernel bug: futex_wait hang

From: paul
Date: Wed Mar 23 2005 - 08:15:27 EST


> Paul is on vacation for a week so I suspect this will have to wait for
> his return. But he's been right about similar issues in the past so I'm
> inclined to believe him.
>
> In the meantime if anyone cares to investigate, the problem is trivial
> to reproduce. All you need is JACK, XMMS, xmms-jack and any 2.6 kernel.

fortunately (or perhaps not), by the powers of webmail, here i am to just
mention a couple more details for anyone who cares to think about this.

the hang occurs during an attempted thread cancel+join. we know from
strace that one thread calls tgkill() on the other. the other thread is
blocked in a poll call on a FIFO. after tgkill, the first thread enters a
futex wait, apparently waiting for the thread ID of the cancelled thread
to appear at some location (just a guess based on the info from strace).
the wait never returns, and so the first thread ends up hung in
pthread_join(). there are no user-defined mutexes or condvars involved.

what is rather odd is that when we have checked for the persistence (or
otherwise) of the cancelled thread, it appears as if several other threads
have died, not just the cancelled one. the primary tester was unclear if
this was to be expected or not (chris morgan, the author of xmms-jack),
and i cannot say definitively whether or not we know for certain that the
cancelled thread no longer exists.

as lee mentioned, i am on vacation right now, using some xp machine at a
relative's house, so i can't test anything till i return.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/