I recently found and fixed a misterious hang that could hang the
kernel with tasks in D state with the disk idle. We could reproduce
very long hangs (several hours) with tasks in D state with reiserfs
after some hour of some intensive load running (not cerberus, see below
why), but it wasn't a reiserfs specific problem, reiserfs just happens
to take the lock_super while doing the fsync_buffer_list and this leads
kupdate to get stuck in the lock_super waiting a wait_on_buffer to
return, so with kupdate stuck the background run_task_queue() doesn't
run every 5 seconds anymore and in turn if there's a missing unplug
somewhere it will lead to an hanging machine in wait_on_buffer for
indefinite/infinite time (kind of deadlock, unless somebody can trigger
a readpage or something that unplugs the disk queue, usually logging in
with ssh fixed the problem). Increasing singificantly the kupdate
interval would potentially lead to the same indefinite hang on a ext2
while running fsync.
For some time I didn't even consider the possibility of wait_on_buffer
being the problem, there are over 700 patches applied in the kernel
where we could reproduce so for some time I was looking at everything
but the buggy place. After ruling out various other bits
(scheduler fixes/compiler/fsync corruption fixes etc..) I actually
realized the problem is a longstanding design locking problem in
wait_on_buffer (then I found the same problem in wait_on_page and
yesterday Chris found a similar problem in get_request_wait too, the
get_request_wait is not exactly the same issue, but it's quite similar
and it could lead to exactly the same hangs).
Probably nobody noticed this yet because normally with ext2/ext3 these
hangs happens in all the machines but they are resolved after a disk
idle time of 2.5 seconds in mean and they happens once in a while,
normally people would see mean delays of 2.5 sec caming from the
datacenter and they would think it's a normal I/O congestion or the
elevator or something during the fsync on ext2. Furthmore as Chris
pointed out with very intensive load bdflush would be usually running in
the background, this race can trigger only with mid writepage loads when
bdflush/pdflush has no reason to run.
We also have the lowlatency fixes (they're fixes) inside submit_bh so we
probably opened a larger window for the race to trigger than mainline.
Chris also double checked the bug we were facing was really this race by
introducing a delay in submit_bh to make it reproducible in a reasonable
amount of time.
the race looks like this:
CPU0 CPU1
----------------- ------------------------
reiserfs_writepage
lock_buffer()
fsync_buffers_list() under lock_super()
wait_on_buffer()
run_task_queue(&tq_disk) -> noop
schedule() <- hang with lock_super acquired
submit_bh()
/* don't unplug here */
This example is reiserfs specific but any wait_on_buffer can definitely
hang indefinitely against any concurrent ll_rw_block or submit_bh (even
on UP since submit_bh is a blocking operation and in particular with the
lowlat fixes). There's no big kernel lock anymore serializing
wait_on_buffer/ll_rw_block. This design locking problem was introduced
with the removal of the BKL from wait_on_buffer/wait_on_page/ll_rw_blk
during one of the 2.3 scalability efforts. So any 2.4 kernel out there
is affected by this race.
in short the problem here is that the wait_on_"something" has no clue if
the locked "something" is just inserted in the I/O queue and visible to the
device, so it has no clue if the run_task_queue may become a noop or if
it may affect the "something". And the writer side that executes the
submit_bh won't unplug the queue rightfully (to allow merging and boost
performance until somebody actually asks for the I/O completed ASAP).
I fixed the race by simply doing a wakeup of any waiter after any
submit_bh/submit_bio that left stuff pending in the I/O queue. So if the
race triggers now the wait_on_something will get a wakeup and in turn it
will trigger the unplug again closing the window for the race. This is
fixing the problem in practice and it seems the best fix at least for
2.4, and I don't see any potential performance regression, so I don't
feel the need of anything more complicated than this, the race triggers
once every several hours only under some special workload. You may try
to avoid loading the waitqueue head cacheline during submit_bh, but at
least for 2.4 I don't think it worth the complexity and it's an I/O path
anyways so it's certainly not critical.
The problem noticed by Chris with get_request_wait is similar, the
unplugging was run before adding the task to the waitqueue, so the
unplug could free the requests and somebody else could allocate the
freed requests without unplugging the queue afterwards. I fixed it simply
by unplugging the queue just before schedule(). That was really a more
genuine bug than the other subtle ones. With get_request_wait the fix is
so simple because we deal with entities that are guaranteed to be
affected by the queue unplug always (they're the I/O requests), this
isn't the case with the locked bh or locked/writeback pages, that was
infact the wrong assumption that allowed the other races to trigger in the
first place.
while doing these fixes I noticed various other bits:
1) in general the blk_run_queues()/run_task_queue()/sync_page should
always run just before schedule(), it's pointless to unplug anything if
we don't run schedule (ultramicrooptimization)
2) in 2.4 the run_task_queue() in wait_on_buffer could have its
TQ_ACTIVE executed inside the add_wait_queue critical section since
spin_unlock has inclusive semantics (literally speculative reads can
pass the spin_unlock even on x86/x86-64)
3) the __blk_put_request/blkdev_release_request was increasing the count
and reading the waitqueue contents without even a barrier() for the asm
layer, it needs an smp_mb() in between to serialize against
get_request_wait that runs locklessy
I did two patches one for 2.4.20rc1 and one for 2.5.47 (sorry no bk
tree here, I will try to make bitdropper.py available shortly so I can
access the new info encoded in proprietary form too) that will address
all these races. However 2.5 should be analysed further, I didn't
search too hard for all the possible places that could have this race in
2.5, I searched hard in 2.4 and I only addressed all the same problems
in 2.5. The only bit I think could be problematic in 2.4 is the nfs
specualtive I/O, the reason nfs is implementing a sync_page in the first
place. That may have the same race, I heard infact of some report with
nfs hung in wait_on_page, and I wonder if this could explain it too.
I assume the fs maintainers will take care of checking their fs for
missing wakeups of page waiters in 2.4 and 2.5 now that the problem is
well known.
http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.20rc1/fix-pausing-1
http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.5/2.5.47/fix-pausing-1
they're attached to this email too since they're tiny.
Andrea
This archive was generated by hypermail 2b29 : Fri Nov 15 2002 - 22:00:24 EST