the attached patch fixes the ext2-fs filesystem corruption that initially
showed up during stress-tests running on Red Hat's internal testsystems,
and which bug kept us busy for a long time. We suspect that a number of
filesystem corruption bugs reported to l-k were caused by this bug too.
after lots of bug-hunting and bug-finding, the main bug turned out to be
this one: a block that might still (or already) be allocated by ext2fs was
bforgot()ten by ext2_get_block(). The (incorrect) assumption of the code
was that this rare case can only happen in case of a parallel truncate()
to this inode. In reality there might be enough delay to this particular
process in which window this same block might have gotten reallocated and
dirtied again. Thus the bforget() discards possibly valid metadata, most
commonly indirect blocks, causing filesystem corruption.
via customized Cerberus testruns the bug used to trigger within 20 minutes
on SMP systems, and within a couple of hours on UP systems. (initially it
was very hard to trigger the bug, it tooks days of nonstop high-load tests
for the corruption to trigger.) Alex Romosan's report to l-k looks
suspiciously similar to the failures we've been getting.
The corruption has been introduced in 2.4.0-test6. With this patch applied
we've been unable to reproduce any kind of filesystem corruption under the
most extreme 'enterprise' loads. The attached patch applies to 2.4.4-pre1
cleanly and compiles, boots just fine. There is a sister-bug in minix-fs
as well, the patch fixes that bug too. The patch also includes two
inode-dirtying fixes, which bugs never caused any filesystem corruptions
but were incorrect nevertheless.
Ingo
This archive was generated by hypermail 2b29 : Sun Apr 15 2001 - 21:00:13 EST