Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

From: Nikita Danilov
Date: Sat Nov 06 2004 - 14:27:38 EST


Andrea Arcangeli writes:
> On Sat, Nov 06, 2004 at 07:54:12PM +0300, Nikita Danilov wrote:
> > Andrea Arcangeli writes:
> > > On Sat, Nov 06, 2004 at 02:37:05PM +0300, Nikita Danilov wrote:
> > > > We need page-reservation API of some sort. There were several attempts
> > > > to introduce this, but none get into mainline.
> > >
> > > they're already in under the name of mempools
> >
> > I am talking about slightly different thing. Think of some operation
> > that calls find_or_create_page(). find_or_create_page() doesn't know
> > about memory reserved in mempools, it uses alloc_page() directly. If one
> > wants to guarantee that compound operation has enough memory to
> > complete, memory should be reserved at the lowest level---in the page
> > allocator.
>
> the page allocator reserve only memory in order to swapout, that's
> PF_MEMALLOC.
>
> For other purposes not related to swapping (which is not a deterministic
> thing, given there can be multiple layers of I/O and fs operations to
> do), you should use mempool and change find_or_create_page to get your
> reserved page as parameter.

This means breaking all layering and passing mempool pointer all the way
down to the lowest layer allocators (like bio and drivers). The only
practical way to do this, is to put mempool pointer into current
task_struct. At which point it's no different from having per-thread
list of pages that __alloc_pages() looks into before falling back to
per-cpu page-sets and buddy. _Except_ in the latter case, reservation is
handled transparently in __alloc_pages() and code shouldn't be adjusted
to check for mempool in zillion of places.

>
> > > I'm perfectly aware the fs tends to be the less correct places in terms
> > > of allocations, and luckily it's not an heavy memory user, so I still
> >
> > Either you are kidding, or we are facing very different workloads. In
> > the world of file-system development, file-system is (not surprisingly)
> > single largest memory consumer.
>
> when the machine runs oom the fs allocations means nothing. when the
> machine runs oom is because somebody entered the malloc loop or
> something like that. all allocations come from page faults.
>
> Try yourself to run your box oom with getblk allocations. Only then
> you'll run into the deadlock.
>
> You've to keep in mind an oom condition happens once in a while, and
> when it happens the userspace memory allocation load is huge compared to
> any fs operation.

I think you are confusing "file system" and "ext2". I definitely know
from experience that with some file system types, system can be oommed
without any significant user-level allocation activity. Now, one can say
that either such file-systems are broken, or Linux MM lacks support for
features (like reservation) they need.

>
> > > have to see a deadlock in getblk or create_buffers or similar. It's
> > > mostly a correctness issue (math proof it can't deadlock, right now it
> > > can if more tasks all get stuck in getblk at the same time during a hard
> > > oom condition etc..).
> >
> > Add here mmap that can dirty all physical memory behind your back, and
> > delayed disk block allocation that forces ->writepage() to allocate
> > potentially huge extent when memory is already tight and hope of having
> > a proof becomes quite remote.
>
> that's the PF_MEMALLOC path. A reservation already exists, or it would
> never work since 2.2. PF_MEMALLOC and the min/2 watermark are meant to
> allow writepage to allocate ram. however the amount reserved is limited,

low-mem watermark is mostly useless in the face of direct reclaim, when
unbounded number of threads enter try_to_free_pages() and call
->writepage() simultaneously.

> so it's not perfect. The only way to make it perfect I believe is to
> reserve the stuff inside the fs with mempools as described above.

I don't see what advantages mempools have over page reservation handled
directly by page allocator, like in

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/reiser4-perthread-pages.patch

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/