Re: upcoming kerneloops.org item: get_page_from_freelist

From: David Rientjes
Date: Thu Jun 25 2009 - 14:52:07 EST

Next message: Frans Pop: "Re: [2.6.31-rc1] device-mapper: target device sda6 is misaligned"
Previous message: venkatesh . pallipadi: "[patch 0/3] Take care of cpufreq lockdep issues"
In reply to: Theodore Tso: "Re: upcoming kerneloops.org item: get_page_from_freelist"
Next in thread: Theodore Tso: "Re: upcoming kerneloops.org item: get_page_from_freelist"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 25 Jun 2009, Theodore Tso wrote:

> On Wed, Jun 24, 2009 at 03:07:14PM -0700, Andrew Morton wrote:
> >
> > fs/jbd/journal.c: new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> >
> > But that isn't :(
>
> Well, we could recode it to do what journal_alloc_head() does, which
> is call the allocator in a loop:
>
> ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
> if (ret == NULL) {
> jbd_debug(1, "out of memory for journal_head\n");
> if (time_after(jiffies, last_warning + 5*HZ)) {
> printk(KERN_NOTICE "ENOMEM in %s, retrying.\n",
> __func__);
> last_warning = jiffies;
> }
> while (ret == NULL) {
> yield();
> ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
> }
> }
>
> Like journal_write_metadata_buffer(), which you quoted, it's called
> out of the commit code, where about the only choice we have other than
> looping or using GFP_NOFAIL is to abort the filesystem and remount it
> read-only or panic. It's not at all clear to me that looping
> repeatedly is helpful; for example, the allocator doesn't know that it
> should try really hard, and perhaps fall back to an order 0 allocation
> of an order 1 allocation won't work.
>

Since it's using kmem_cache_alloc(), the order fallback is the
responsibility of the slab allocator when a new slab allocation fails and
a single object could fit in an order 0 page, so it's not a concern for
this particular allocation.

There's no way to indicate that the page allocator should "try really
hard" because the VM implementation should already do that for every
allocation before failure. A subsequent attempt after the first failure
could try GFP_ATOMIC, though, which allows allocation beyond the minimum
watermark and is more likely to succeed than GFP_NOFS. Such an
allocation should be short-lived and not rely on additional memory to free
to avoid depleting most of the memory reserves available to atomic
allocations, direct reclaim, and oom killed tasks.

> Hmm.... it may be possible to do the memory allocation in advance,
> before we get to the commit, and make it be easier to fail and return
> ENOMEM to userspace --- which I bet most applications won't handle
> gracefully, either (a) not checking error codes and losing data, or
> (b) dieing on the spot, so it would be effectively be an OOM kill.

If this would still be a GFP_NOFS allocation, the oom killer will not be
triggered (it only gets called when __GFP_FS is set to avoid killing tasks
when reclaim was not possible).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Frans Pop: "Re: [2.6.31-rc1] device-mapper: target device sda6 is misaligned"
Previous message: venkatesh . pallipadi: "[patch 0/3] Take care of cpufreq lockdep issues"
In reply to: Theodore Tso: "Re: upcoming kerneloops.org item: get_page_from_freelist"
Next in thread: Theodore Tso: "Re: upcoming kerneloops.org item: get_page_from_freelist"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]