Re: [NFSD OOPS] 2.6.23-rc1-git10
From: Neil Brown
Date: Thu Aug 02 2007 - 23:54:12 EST
On Thursday August 2, andrew@xxxxxxxxxxxxxxxxxx wrote:
> Hi,
>
> Got the following oopsen just now from 2.6.23-rc1-git10 on a 2
> processor Opteron with 2GB RAM. System is running 64bit Fedora Core 6.
>
> nfsd is exporting directories from a XFS filesystem on top a software
> RAID 5 array comprising 3 x 250GB SATA disks using the sata_sil driver.
>
> Amongst other things, users (about 12) home directories are
> automounted from it.
>
> Some problems seem to start when I was experimenting with the
> stripe_cache_size of the md array. People could no longer run some apps
> and having general NFS issues. IIRC I had dropped the stripe_cache_size
> down from 8192 to 768 then I attempted to restart nfsd
> via /etc/rc.d/init.d/nfsd restart
>
> I then got the Oopses and the machine locked up needing a power cycle.
>
>
> Aug 2 10:15:55 thor mountd[30048]: Caught signal 15, un-registering and exiting.
> Aug 2 10:15:55 thor nfsd[16390]: nfssvc: Setting version failed: errno 16 (Device or resource busy)
> Aug 2 10:15:56 thor kernel: lockd_down: lockd failed to exit, clearing pid
> Aug 2 10:15:56 thor kernel: nfsd: last server has exited
> Aug 2 10:15:56 thor kernel: nfsd: unexporting all filesystems
> Aug 2 10:15:56 thor kernel: ------------[ cut here ]------------
> Aug 2 10:15:56 thor kernel: kernel BUG at /home/andrew/src/linux-2.6/net/sunrpc/svc.c:488!
It seems you have found a race between shutting down and starting up
nfsd.
You md array was probably going very slowly due to playing with
stripe_cache_size, and that stretched things out long enough to loose
the race.
NFSD is shut down by something similar to
killall nfsd
which sends signals to all the threads, but doesn't wait for them to
exit.
The first line in the log is mountd exiting. nfsd should have all
been signalled at the same time.
The second line comes from the nfsd program trying to start up the
nfsd threads again, and getting an error because there were some
already running.
The third is a similar issue with lockd.
The 4th and 5th show the last thread exiting. Only it isn't really
the last thread. By this stage some more nfsd threads have started
up. There was probably a backlog of requests as things were running
slowly so as soon as a new nfsd thread was started, it accepted a
connection and created a new temporary socket. This would have been
after the thread that thought it was the last thread, thought it had
closed the last temp socket.
This caused the BUG.
As part of shutting down, the thread that thought it was last cleared
out the read-ahead info cache. When the new threads tried to use it,
they all suffered general protection faults.
So clearly we need some proper locking around thread start-up and
shutdown. We had previous relied on lock_kernel, but that isn't
really good enough for this. I'll try to figure out the best way to
fix it. Meanwhile, I doubt the problem will recur.
Thanks for the report.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/