Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110

From: Justin Piszcz
Date: Mon Aug 15 2011 - 15:02:55 EST

On Mon, 15 Aug 2011, Hugh Dickins wrote:

On Mon, 15 Aug 2011, Justin Piszcz wrote:

What causes this(?) -- am I out of memory(?) or is this a kernel bug?

It would be a kernel bug to lock up even if you are out of memory.
This machine has 48GB of RAM and its just a linux router and some gqview's

It does look like you're under memory pressure, but I don't see any OOM.

Is this something you've noticed just once, or does it happen repeatedly?
This has happened once before (I've e-mailed LKML about it last weekend or
thereabouts but nobody responded)

It is here: (down?)

Does it always hit somewhere in find_get_pages(), or does the loop span
wider than that?

Slightly different (From August 12)

75 [330509.718763] Call Trace:
76 [330509.718771] [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
77 [330509.718776] [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
78 [330509.718784] [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
79 [330509.718788] [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
80 [330509.718793] [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
81 [330509.718798] [<ffffffff8108efbc>] ? kswapd+0xac/0x250
82 [330509.718803] [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
83 [330509.718807] [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
84 [330509.718811] [<ffffffff8105082e>] ? kthread+0x7e/0x90
85 [330509.718818] [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
86 [330509.718822] [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
87 [330509.718825] [<ffffffff815b4e10>] ? gs_change+0xb/0xb

The first time it happened was when running a lot of I/O \
(dumps and streams/backups over SSH).

I'm answering out of interest in find_get_pages(): which does contain
a number of gotos which could result in endless looping; except that
they're all supposed to be for very transitory conditions which a
second glance at the RCU-protected tree should correct.
I am using 'server' for the workload type, not 'low latency' -- which exposes
more bugs/problems..

But if a radix_tree node got corrupted, then yes, it could loop forever.

If it's repeatable, please try again with slab poisoning (and frame
pointers) enabled?
I will enable frame pointers and wait for the next error/problem and report
back if/when it recurs, thanks!


