io-stalls again (was "Re: Bug in drivers/block/ll_rw_blk.c")

From: Yoav Weiss
Date: Fri Aug 22 2003 - 13:33:34 EST


On Fri, 22 Aug 2003, Livio Baldini Soares wrote:

[...snip...]

[ for people who jump in, my original description of the problem can be
found here: http://lkml.org/lkml/2003/8/19/259 ]

> From this description it seems that you are hitting a bug which was
> discussed to death here on the list. Here's a thread with 143 messages for
> you:
>
> http://marc.theaimsgroup.com/?t=105400721000001&r=5&w=2
>
> And here are the threads in which a solution was dicussed:
>
> http://marc.theaimsgroup.com/?t=105519528200001&r=1&w=2
> http://marc.theaimsgroup.com/?t=105769525800005&r=3&w=2
>
> Notice, however, that the patch Chris, Andrea, Jens and others made for
> this problem is _already_ included in 2.4 (so, yes, 2.4.22-rc2 has the
> fix).
>

Yes, I guess its related to the same problem. I think the patch actually
broke something. I see that it was introduced in 2.4.22-pre3, and thats
exactly where the problem became much worse. I switched back to
2.4.22-pre2 and it mostly works. It still stalls on extreme conditions
but not as easily as with later kernels. With pre7 and rc2 which I tested
lately, it happens very quickly under heavy load.

> So, you are probably hitting the same bug, which was not fixed 100%. If
> you think that your test is very easily reproducible and can shed more
> light on this problem, perhaps you should write to Chris, Andrea and Jens
> (with Cc: to the list), and show them the test. I don't know if they would
> be willing to spend more time on this issue, specially with 2.6 around the
> corner...
>

Not only did the patch fail to fix it 100%, but it actually made it a lot
worse in my case.

Its easily reproducable once you have a big cloop image in place, but I
guess that doesn't qualify as easy reproduction for busy kernel
developers. I hope someone will still take the time to look into it.

If someone has some speculations/suggestions but no time to test it, send
it to me and I'll run the tests and post the results.

The easiest way to trigger it with recent kernels is to download a large
cloop image such as the large file called KNOPPIX in the Knoppix ISO
image, attach it, and create a lot or load on it.

If someone wishes to try this, here's how I reproduce it:
* Download the latest ISO from http://knoppix.net/get.php
* mount -o loop the image and extract the file KNOPPIX/KNOPPIX
* Download cloop from
http://developer.linuxtag.net/knoppix/sources/cloop_1.0-2.tar.gz
* extract cloop, make, insmod cloop.o, mknod /dev/cloop b 200 0
* losetup /dev/cloop /path/to/KNOPPIX && mount /dev/cloop /mnt
* tar cf - /mnt >/dev/null
* while tar is running, access some random files in /mnt.

With 2.4.22-rc2 the above will stall in less than a minute and will remain
stalled until another process accesses other files in the filesystem
storing KNOPPIX.

It may be possible to reproduce the same stall with loop.o but takes much
longer. Could be related to the fact that cloop.o is a single thread
while loop.o has a separate reader thread. Could this affect the problem ?

Anyway, if someone has a suggested test/patch, post it and I'll post the
results. Hopefully we can nail this next-to-last bug :)

> best regards,
>
> --
> Livio B. Soares
>

Thanks,
Yoav Weiss

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/