On Tue, Apr 24 2007, Jens Axboe wrote:The files are 100-400MB in size and the client is merging them into a new file in the same directory using the ROOT library, which does in essence alternating sequences ofOn Tue, Apr 24 2007, Roland Kuhn wrote:Hi Jens!
On 24 Apr 2007, at 11:18, Jens Axboe wrote:
On Tue, Apr 24 2007, Roland Kuhn wrote:Sure. You might want to include NFS file access into your tests,Hi Jens!
We're using a custom built fileserver (dual core Athlon64, using
x64_64 arch) with 22 disks in a RAID6 and while resyncing /dev/md2
(9.1GB ext3) after a hardware incident (cable pulled on one disk) the
machine would reliably oops while serving some large files over
NFSv3. The oops message scrolled partly off the screen, but the IP
was in cfq_dispatch_insert, so I tried your debug patch from
yesterday with 2.6.21-rc7. I used netconsole for capturing the output
(which works nicely, thanks Matt!) and as usual the condition
triggered after about half a minute, this with the following printout
instead of crashing (still works fine):
cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the
issue to lkml@xxxxxxxxxxxxxxx
cfq: busy=1,drv=1,timer=0
cfq rr_list:
cfq busy_list:
4272: sort=0,next=0000000000000000,q=0/1,a=2/0,d=0/1,f=221
cfq idle_list:
cfq cur_rr:
cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the
issue to lkml@xxxxxxxxxxxxxxx
cfq: busy=1,drv=1,timer=0
cfq rr_list:
cfq busy_list:
4276: sort=0,next=0000000000000000,q=0/1,a=2/0,d=0/1,f=221
cfq idle_list:
cfq cur_rr:
There was no backtrace, so the only thing I can tell is that for the
previous crashes some nfs threads were always involved, only once did
it happen inside an interrupt handler (with the "aieee" kind of
message).
If you want me to try something else, don't hesitate to ask!
Nifty, great that you can reproduce so quickly. I'll try a 3-drive
raid6
here and see if read activity along with a resync will trigger
anything.
If that doesn't work for me, I'll provide you with a more extensive
debug patch (if you don't mind).
since we've not triggered this with locally accessing the disks. BTW:
How are you exporting the directory (what exports options) - how is it
mounted by the client(s)? What chunksize is your raid6 using?
And what are the nature of the files on the raid (huge, small, ?) and
what are the client(s) doing? Just approximately, I know these things
can be hard/difficult/impossible to specify.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
Attachment:
PGP.sig
Description: This is a digitally signed message part