Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.43-00

From: Steven Rostedt
Date: Sat Apr 02 2005 - 19:11:44 EST


On Sat, 2005-04-02 at 17:09 -0500, Steven Rostedt wrote:
> On Sat, 2005-04-02 at 15:44 -0500, Steven Rostedt wrote:
> > On Sat, 2005-04-02 at 22:35 +0200, Ingo Molnar wrote:

> I tried the 2.6.12-rc1-RT-V0.7.43-06 kernel again, and I still have the
> serial, to do sysrq. The console is using X which locks (even all the
> ctrl-alt-X functions) and the ssh session that I run the test stops
> after the processes try to grab the locks. It doesn't reply to ping. But
> the sysrq from the serial shows constantly:
>
> s test3_rt: 2269 [cf304690, 89] (not blocked)
> s test3_rt: 2270 [cf304050, 88] (not blocked)
> s test3_rt: 2271 [cef86cb0, 87] (not blocked)
> s test3_rt: 2272 [cef86670, 86] (not blocked)
> R test3_rt: 2273 [cf28a050, 85] (not blocked)
>
>
> So it seems that it's in a deadlock somewhere. Since the 43-06 gets much
> further, this seems to be another problem. I'm not going to look at this
> problem anymore, since it doesn't show up in the lastest.
>
> I'll run a few more tests to see if I can narrow things down on 43-06.

Once again I spoke too soon. I got the same lockup with 43-06, it just
took a little more to do so. All I did differently was to put the test
into a while loop on the command line:

while : ; do test3_rs ; done

On the second iteration, it locked up in the same place as 36-02. But
this is a different issue than what I showed with a NFS mounted system.
That was a BUG where as this has something in a spinning loop somewhere.
The attached is the gzipped minicom capture. After the system locked
up, I was still able to get the serial input and do a few task dumps. I
did several and it clearly shows that the system is locked. What you
constantly see is:

R test3_rt: 2263 [cbd6b9b0, 116] (not blocked)
S test3_rt: 2264 [cc06d8d0, 89] (not blocked)
S test3_rt: 2265 [cc06c090, 88] (not blocked)
R test3_rt: 2266 [cc06ccb0, 87] (not blocked)
D test3_rt: 2267 [cc1286c0, 86] blocked on: [ce127650] {(struct
semaphore *)(&inode->i_sem)}
.. held by: test3_rt: 2268 [ce822760, 85]
... acquired at: vfs_readdir+0x53/0xa0
D test3_rt: 2268 [ce822760, 85] (not blocked)


Process 2263 started the run and has just released the semaphore, it was
preempted before getting to the wait and although still running, it is
not really part of the picture.

Process 2267 is blocked on 2268 which is sleeping at
fs/jbd/transaction.c:636 which is in do_get_write_access waiting (I
believe) for a transaction to finish. Probably for kjournald to run,
since it is also in the running state.

But the problem seems to be with 2266 where as it's trace shows:

test3_rt [cc06ccb0]R C0371AD6 [on rq] 6824 2266 2263 2267 2265 (NOTLB)
cadebc48 00000046 c04eea30 c0371ad6 00000000 00000001 00000016 c0371aa1
c0102d00 00000000 00000000 11de7305 071e89c0 00000071 cc06ccb0 cc06cde8
ffffffff cc06ccb0 cfae68d4 cadebc5c c0371ad6 10000000 cabc71ac cfae68d4
Call Trace:
[<c0371ad6>] preempt_schedule_irq+0x46/0x70 (20)
[<c0102d00>] need_resched+0x20/0x24 (-7260)
---------------------------
| preempt count: 00000002 ]
| 2-level deep critical section nesting:
----------------------------------------
.. [<c03710df>] .... __schedule+0x4f/0x840
.....[<c0371ad6>] .. ( <= preempt_schedule_irq+0x46/0x70)
.. [<c0371179>] .... __schedule+0xe9/0x840
.....[<c0371ad6>] .. ( <= preempt_schedule_irq+0x46/0x70)

Which is not very helpful.

So it is probably stuck in some spinning "yield" loop, which was the
reason I was writing this test to begin with! It's most likely also
waiting for kjournald to do some work, and is starving it in a schedule
or yield loop never actually going to sleep letting kjournald do the
real work.


-- Steve

Attachment: lockup.cap.gz
Description: GNU Zip compressed data