We're running a distributed computing environment on these boxes. We
have a process manager that launches jobs and monitors their
execution. In this case, a big job (twice the physical memory) starts on
the box, and just about everything else gets paged out. Our manager
application repeatedly reads the entries in /proc to find out how much
memory its child is using.
In the process of doing this, some other process on the machine
exits. The manager code has already opened the appropriate status entry
in /proc, and has made the call to read the data. At this point the
kernel Oops in proc_unregister and it kills the manager. The read does
not complete.
We've been able to reproduce this on a test machine. It only happens if
we have the machine paging fairly heavily. We reproduced it by starting
a job that just mallocs large chunks of data and touches all the bytes.
Hopefully, this is enough to point someone in the right direction. In
the mean time, we've taken out the memory watching and everything is
behaving fine. I'm just afraid my users will notice and start launching
even bigger jobs. :-)
- |Daryll