On Mon, 8 Jun 2009, James Bottomley wrote:The root cause is a reordering of the devices caused by the async code.
That's NULL information.
OF COURSE the root cause is the async code. We know that. We're looking for the specifics.
In particular, before that commit, at most you will wait for too _much_. In other words, it's a "good" wait.
Your commit caused it to wait for less, and that then showed a bug. Not all that surprising - it's now not waiting enough.
You tried to avoid a deadlock situation of waiting for too much, but you avoided the deadlock by now waiting for too little.
I also think that your code is simply buggy. As far as I can tell, int he case of having both running and pending events, you'll always pick the pending cookie. But it's the _running_ cookie that has the lower event number, isn't it?
I dunno. It all looks very fishy to me.