Re: [PATCH 0/5] make slab gfp fair

From: Peter Zijlstra
Date: Wed May 16 2007 - 16:19:52 EST


On Wed, 2007-05-16 at 12:53 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > If this 4k cpu system ever gets to touch the new lock it is in way
> > deeper problems than a bouncing cache-line.
>
> So its no use on NUMA?

It is, its just that we're swapping very heavily at that point, a
bouncing cache-line will not significantly slow down the box compared to
waiting for block IO, will it?

> > Please look at it more carefully.
> >
> > We differentiate pages allocated at the level where GFP_ATOMIC starts to
> > fail. By not updating the percpu slabs those are retried every time,
> > except for ALLOC_NO_WATERMARKS allocations; those are served from the
> > ->reserve_slab.
> >
> > Once a regular slab allocation succeeds again, the ->reserve_slab is
> > cleaned up and never again looked at it until we're in distress again.
>
> A single slab? This may only give you a a single object in an extreme
> case. Are you sure that this solution is generic enough?

Well, single as in a single active; it gets spilled into the full list
and a new one is instanciated if more is needed.

> The problem here is that you may spinlock and take out the slab for one
> cpu but then (AFAICT) other cpus can still not get their high priority
> allocs satisfied. Some comments follow.

All cpus are redirected to ->reserve_slab when the regular allocations
start to fail.

> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
> > ---
> > include/linux/slub_def.h | 2 +
> > mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 78 insertions(+), 9 deletions(-)
> >
> > Index: linux-2.6-git/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6-git.orig/include/linux/slub_def.h
> > +++ linux-2.6-git/include/linux/slub_def.h
> > @@ -46,6 +46,8 @@ struct kmem_cache {
> > struct list_head list; /* List of slab caches */
> > struct kobject kobj; /* For sysfs */
> >
> > + struct page *reserve_slab;
> > +
> > #ifdef CONFIG_NUMA
> > int defrag_ratio;
> > struct kmem_cache_node *node[MAX_NUMNODES];
> > Index: linux-2.6-git/mm/slub.c
> > ===================================================================
> > --- linux-2.6-git.orig/mm/slub.c
> > +++ linux-2.6-git/mm/slub.c
> > @@ -20,11 +20,13 @@
> > #include <linux/mempolicy.h>
> > #include <linux/ctype.h>
> > #include <linux/kallsyms.h>
> > +#include "internal.h"
> >
> > /*
> > * Lock order:
> > - * 1. slab_lock(page)
> > - * 2. slab->list_lock
> > + * 1. reserve_lock
> > + * 2. slab_lock(page)
> > + * 3. node->list_lock
> > *
> > * The slab_lock protects operations on the object of a particular
> > * slab and its metadata in the page struct. If the slab lock
> > @@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
> > static void sysfs_slab_remove(struct kmem_cache *s) {}
> > #endif
> >
> > +static DEFINE_SPINLOCK(reserve_lock);
> > +
> > /********************************************************************
> > * Core slab cache functions
> > *******************************************************************/
> > @@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
> > s->ctor(object, s, 0);
> > }
> >
> > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
> > {
> > struct page *page;
> > struct kmem_cache_node *n;
> > @@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
> > if (!page)
> > goto out;
> >
> > + *rank = page->rank;
> > n = get_node(s, page_to_nid(page));
> > if (n)
> > atomic_long_inc(&n->nr_slabs);
> > @@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
> > /*
> > * Remove the cpu slab
> > */
> > -static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
> > +static void __deactivate_slab(struct kmem_cache *s, struct page *page)
> > {
> > /*
> > * Merge cpu freelist into freelist. Typically we get here
> > @@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
> > page->freelist = object;
> > page->inuse--;
> > }
> > - s->cpu_slab[cpu] = NULL;
> > unfreeze_slab(s, page);
> > }
>
> So you want to spill back the lockless_freelist without deactivating the
> slab? Why are you using the lockless_freelist at all? If you do not use it
> then you can call unfreeze_slab. No need for this split.

Ah, quite right. I do indeed not use the lockless_freelist.

> > @@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
> > {
> > void **object;
> > int cpu = smp_processor_id();
> > + int rank = 0;
> >
> > if (!page)
> > goto new_slab;
> > @@ -1424,10 +1435,26 @@ new_slab:
> > if (page) {
> > s->cpu_slab[cpu] = page;
> > goto load_freelist;
> > - }
> > + } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
> > + goto try_reserve;
>
> Ok so we are trying to allocate a slab and do not get one thus ->
> try_reserve.

Right, so the cpu-slab is NULL, and we need a new slab.

> But this is only working if we are using the slab after
> explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> get to alloc_slab.

/me fails to parse.

When we need a new_slab:
- we try the partial lists,
- we try the reserve (if ALLOC_NO_WATERMARKS)
otherwise alloc_slab

> >
> > - page = new_slab(s, gfpflags, node);
> > - if (page) {
>
> > +alloc_slab:
> > + page = new_slab(s, gfpflags, node, &rank);
> > + if (page && rank) {
>
> Huh? You mean !page?

No, no, we did get a page, and it was !ALLOC_NO_WATERMARK hard to get
it.

> > + if (unlikely(s->reserve_slab)) {
> > + struct page *reserve;
> > +
> > + spin_lock(&reserve_lock);
> > + reserve = s->reserve_slab;
> > + s->reserve_slab = NULL;
> > + spin_unlock(&reserve_lock);
> > +
> > + if (reserve) {
> > + slab_lock(reserve);
> > + __deactivate_slab(s, reserve);
> > + putback_slab(s, reserve);
>
> Remove the above two lines (they are wrong regardless) and simply make
> this the cpu slab.

It need not be the same node; the reserve_slab is node agnostic.
So here the free page watermarks are good again, and we can forget all
about the ->reserve_slab. We just push it on the free/partial lists and
forget about it.

But like you said above: unfreeze_slab() should be good, since I don't
use the lockless_freelist.

> > + }
> > + }
> > cpu = smp_processor_id();
> > if (s->cpu_slab[cpu]) {
> > /*
> > @@ -1455,6 +1482,18 @@ new_slab:
> > SetSlabFrozen(page);
> > s->cpu_slab[cpu] = page;
> > goto load_freelist;
> > + } else if (page) {
> > + spin_lock(&reserve_lock);
> > + if (s->reserve_slab) {
> > + discard_slab(s, page);
> > + page = s->reserve_slab;
> > + }
> > + slab_lock(page);
> > + SetPageActive(page);
> > + s->reserve_slab = page;
> > + spin_unlock(&reserve_lock);
> > +
> > + goto got_reserve;

So this is when we get a page and it was ALLOC_NO_WATERMARKS hard to get
it. Instead of updating the cpu_slab we leave that unset, so that
subsequent allocations will try to allocate a slab again thereby testing
the current free pages limit (and not gobble up the reserve).

> > }
> > return NULL;
> > debug:
> > @@ -1470,6 +1509,31 @@ debug:
> > page->freelist = object[page->offset];
> > slab_unlock(page);
> > return object;
> > +
> > +try_reserve:
> > + spin_lock(&reserve_lock);
> > + page = s->reserve_slab;
> > + if (!page) {
> > + spin_unlock(&reserve_lock);
> > + goto alloc_slab;
> > + }
> > +
> > + slab_lock(page);
> > + if (!page->freelist) {
> > + s->reserve_slab = NULL;
> > + spin_unlock(&reserve_lock);
> > + __deactivate_slab(s, page);
> replace with unfreeze slab.
>
> > + putback_slab(s, page);
>
> Putting back the slab twice.

__deactivete_slab() doesn't do putback_slab, and now I see that whole
function isn't there anymore. unfreeze_slab() it is.

> > + goto alloc_slab;
> > + }
> > + spin_unlock(&reserve_lock);
> > +
> > +got_reserve:
> > + object = page->freelist;
> > + page->inuse++;
> > + page->freelist = object[page->offset];
> > + slab_unlock(page);
> > + return object;



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/