Re: [PATCH] coredump: Retry writes where appropriate

From: Paul Smith
Date: Sun May 31 2009 - 12:56:47 EST


On Sun, 2009-05-31 at 16:03 +0200, Olivier Galibert wrote:
> On Sun, May 31, 2009 at 11:18:51AM +0100, Alan Cox wrote:
> > On Sun, 31 May 2009 01:33:39 -0400
> > Paul Smith <paul@xxxxxxxxxxxxxxxxx> wrote:
> >
> > > coredump: Retry writes where appropriate
> > >
> > > Core dump write operations (especially to a pipe) can be incomplete due
> > > to signal reception or possibly recoverable partial writes.
> >
> > NAK this
> >
> > > Previously any incomplete write in the ELF core dumper caused the core
> > > dump to stop, giving short cores in these cases. Modify the core dumper
> > > to retry the write where appropriate.
> >
> > The existing behaviour is an absolute godsend when you've something like
> > a core dump stuck on an NFS mount or something trying to core dump to
> > very slow media.
> >
> > In fact the signals checks were *purposefully added* some time ago.

This is what Olivier mentioned as well, and I do see the benefit in
being able to get rid of hung up coredumps. But to me it's more
important to have reliable and robust coredumping, and I'm getting
reports of short cores on my systems at least once a week due to this
problem (the userspace applications I'm working with use signals for
certain well-defined situations, that tend to happen at around the same
time as you might expect core dumps).

> Perhaps removing the "|| r == -EINTR" part would make both of you
> happy? He gets the reliability on pipes, you keep the interrupt on
> signals.

I'm getting back ERESTARTSYS in my environment, and it's happening
because pipe_write() detects a signal pending. I don't think this is
due to SIGPIPE, and I'm not sure that removing EINTR will give Alan the
behavior he is looking for.

Another possibility would be to examine the signal itself and don't
retry if it's SIGKILL. I'm too much of a kernel hacking noob to know
offhand how to find the pending signal but I can certainly figure it
out. If it's possible, Alan, would that be an acceptable alternative?

I'm not entirely happy with this because, as was discussed in an earlier
thread, there are plenty of common idioms where you can expect to
receive a SIGKILL while you're in the middle of dumping core (lots of
userspace setups will send a few HUPs, followed by a few INTs, and if
the process is still there they send KILLs). However, for my specific
purposes this would be sufficient.

Other ideas?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/