Re: Higher slub memory consumption on 64K page-size systems?

From: Bharata B Rao
Date: Wed Nov 11 2020 - 04:04:16 EST

Next message: Krzysztof Kozlowski: "Re: [PATCH v8 11/26] memory: tegra124-emc: Make driver modular"
Previous message: Serge Semin: "[PATCH v3 3/3] usb: dwc3: ulpi: Fix USB2.0 HS/FS/LS PHY suspend regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Nov 05, 2020 at 05:47:03PM +0100, Vlastimil Babka wrote:
> On 10/28/20 6:50 AM, Bharata B Rao wrote:
> > slub_max_order
> > --------------
> > The most promising tunable that shows consistent reduction in slab memory
> > is slub_max_order. Here is a table that shows the number of slabs that
> > end up with different orders and the total slab consumption at boot
> > for different values of slub_max_order:
> > -------------------------------------------
> > slub_max_order Order NrSlabs Slab memory
> > -------------------------------------------
> > 0 276
> > 3 1 16 207488 kB
> > (default) 2 4
> > 3 11
> > -------------------------------------------
> > 0 276
> > 2 1 16 166656 kB
> > 2 4
> > -------------------------------------------
> > 0 276 144128 kB
> > 1 1 31
> > -------------------------------------------
> >
> > Though only a few bigger sized caches fall into order-2 or order-3, they
> > seem to make a considerable difference to the overall slab consumption.
> > If we take task_struct cache as an example, this is how it ends up when
> > slub_max_order is varied:
> >
> > task_struct, objsize=9856
> > --------------------------------------------
> > slub_max_order objperslab pagesperslab
> > --------------------------------------------
> > 3 53 8
> > 2 26 4
> > 1 13 2
> > --------------------------------------------
> >
> > The slab page-order and hence the number of objects in a slab has a
> > bearing on the performance, but I wonder if some caches like task_struct
> > above can be auto-tuned to fall into a conservative order and do good
> > both wrt both memory and performance?
>
> Hmm ideally this should be based on objperslab so if there's larger page
> sizes, then the calculated order becomes smaller, even 0?

It is indeed based on number of objects that could be optimally
fit within a slab. As I explain below, curently we start with a
minimum objects value that ends up pushing the page order higher
for some slab sizes and page size combination. The question is can
we start with a more conservative/lower value for min_objects in
calculate_order()?

>
> > mm/slub.c:calulate_order() has the logic which determines the the
> > page-order for the slab. It starts with min_objects and attempts
> > to arrive at the best configuration for the slab. The min_objects
> > is starts like this:
> >
> > min_objects = 4 * (fls(nr_cpu_ids) + 1);
> >
> > Here nr_cpu_ids depends on the maxcpus and hence this can have a
> > significant effect on those systems which define maxcpus. Slab numbers
> > post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> > number of maxcpus look like this:
> > -------------------------------
> > maxcpus Slab memory(kB)
> > -------------------------------
> > 64 209280
> > 256 253824
> > 512 293824
> > -------------------------------
>
> Yeah IIRC nr_cpu_ids is related to number of possible cpus which is rather
> excessive on some systems, so a relation to actually online cpus would make
> more sense.

May be I can send a patch to change the above calculation of
min_objects to be based on online cpus and see how it is received.

>
> > Page-order is a one time setting and obviously can't be tweaked dynamically
> > on CPU hotplug, but just wanted to bring out the effect of the same.
> >
> > And that constant multiplicative factor of 4 was infact added by the commit
> > 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> >
> > Reducing that to say 2, does give some reduction in the slab memory
> > and also same hackbench performance with reduced slab memory, but I am not
> > sure if that could be assumed to be beneficial for all scenarios.
> >
> > MIN_PARTIAL
> > -----------
> > This determines the number of slabs left on the partial list even if they
> > are empty. My initial thought was that the default MIN_PARTIAL value of 5
> > is on the higher side and we are accumulating MIN_PARTIAL number of
> > empty slabs in all caches without freeing them. However I hardly find
> > the case where an empty slab is retained during freeing on account of
> > partial slabs being lesser than MIN_PARTIAL.
> >
> > However what I find in practice is that we are accumulating a lot of partial
> > slabs with just one in-use object in the whole slab. High number of such
> > partial slabs is indeed contributing to the increased slab memory consumption.
> >
> > For example, after a hackbench run, I find the distribution of objects
> > like this for kmalloc-2k cache:
> >
> > total_objects 3168
> > objects 1611
> > Nr partial slabs 54
> > Nr parital slabs with
> > just 1 inuse object 38
> >
> > With 64K page-size, so many partial slabs with just 1 inuse object can
> > result in high memory usage. Is there any workaround possible prevent this
> > kind of situation?
>
> Probably not, this is just fundamental internal fragmentation problem and
> that we can't predict which objects will have similar lifetime and thus put
> it together. Larger pages make just make the effect more pronounced. It
> would be wrong if we allocated new pages instead of reusing the partial
> ones, but that's not the case, IIUC?

Correct, that shouldn't be the case, I will check by adding some
instrumentation and ascertain if it indeed the case.

>
> But you are measuring "after a hackbench run", so is that an important data
> point? If the system was in some kind of steady state workload, the pages
> would be better used I'd expect.

May be, I am not sure, we will have to check. I measured at two points: immediately
after boot as initial state and after hackbench run as an exteme state. I chose
hackbench as I see that earlier changes to some of these slab code/tunables
have been supported by hackbench numbers.

>
> > cpu_partial
> > -----------
> > Here is how the slab consumption post-boot varies when all the slab
> > caches are forced with the fixed cpu_partial value:
> > ---------------------------
> > cpu_partial Slab Memory
> > ---------------------------
> > 0 175872 kB
> > 2 187136 kB
> > 4 191616 kB
> > default 204864 kB
> > ---------------------------
> >
> > It has been suggested earlier that reducing cpu_partial and/or making
> > cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
> > for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
> > to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
> > slabs does give some benefit.
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a28ed9b8fc61..e09eff1199bf 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s)
> > */
> > if (!kmem_cache_has_cpu_partial(s))
> > slub_set_cpu_partial(s, 0);
> > - else if (s->size >= PAGE_SIZE)
> > + else if (s->size >= 8192)
> > + slub_set_cpu_partial(s, 1);
> > + else if (s->size >= 4096)
> > slub_set_cpu_partial(s, 2);
> > else if (s->size >= 1024)
> > slub_set_cpu_partial(s, 6);
> >
> > With the above change, the slab consumption post-boot reduces to 186048 kB.
>
> Yeah, making it agnostic to PAGE_SIZE makes sense.

Ok, let me send a separate patch for this.

Thanks for your inputs.

Regards,
Bharata.

Next message: Krzysztof Kozlowski: "Re: [PATCH v8 11/26] memory: tegra124-emc: Make driver modular"
Previous message: Serge Semin: "[PATCH v3 3/3] usb: dwc3: ulpi: Fix USB2.0 HS/FS/LS PHY suspend regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]