Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc andkzalloc

From: Peter Zijlstra
Date: Wed Aug 25 2010 - 08:49:04 EST


On Wed, 2010-08-25 at 07:57 -0400, Ted Ts'o wrote:
> On Wed, Aug 25, 2010 at 01:35:32PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
> > > Part of the problem is that we have a few places in the kernel where
> > > failure is really not an option --- or rather, if we're going to fail
> > > while we're in the middle of doing a commit, our choices really are
> > > (a) retry the loop in the jbd layer (which Andrew really doesn't
> > > like), (b) keep our own private cache of free memory so we don't fail
> > > and/or loop, (c) fail the file system and mark it read-only, or (d)
> > > panic.
> >
> > d) do the allocation before you're committed to going fwd and can still
> > fail and back out.
>
> Sure in some cases that can be done, but the commit has to happen at
> some point, or we run out of journal space, at which point we're back
> to (c) or (d).

Well (b) sounds a lot saner than either of those. Simply revert to a
state that is sub-optimal but has bounded memory use and reserve that
memory up-front. That way you can always get out of a tight memory spot.

Its what the block layer has always done to avoid the memory deadlock
situation, it has a private stash of BIOs that is big enough to always
service some IO, and as long as IO is happening stuff keeps moving fwd
and we don't deadlock.

Filesystems might have a slightly harder time creating such a bounded
state because there might be more involved like journals and the like,
but still it should be possible to create something like that (my swap
over nfs patches created such a state for the network rx side of
things).

Also, we cannot let our fear of crappy userspace get in the way of doing
sensible things. Your example of write(2) returning -ENOMEM is not
correct though, the syscall (and the page_mkwrite callback for mmap()s)
happens before we actually dirty the data and need to write things out,
so we can always simply wait for memory to become available to dirty.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/