Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups

From: Johannes Weiner
Date: Mon Feb 03 2020 - 17:17:38 EST


On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > This is fairly big but mostly red patch, which makes all non-root
> > > slab allocations use a single set of kmem_caches instead of
> > > creating a separate set for each memory cgroup.
> > >
> > > Because the number of non-root kmem_caches is now capped by the number
> > > of root kmem_caches, there is no need to shrink or destroy them
> > > prematurely. They can be perfectly destroyed together with their
> > > root counterparts. This allows to dramatically simplify the
> > > management of non-root kmem_caches and delete a ton of code.
> >
> > This is definitely going in the right direction. But it doesn't quite
> > explain why we still need two sets of kmem_caches?
> >
> > In the old scheme, we had completely separate per-cgroup caches with
> > separate slab pages. If a cgrouped process wanted to allocate a slab
> > object, we'd go to the root cache and used the cgroup id to look up
> > the right cgroup cache. On slab free we'd use page->slab_cache.
> >
> > Now we have slab pages that have a page->objcg array. Why can't all
> > allocations go through a single set of kmem caches? If an allocation
> > is coming from a cgroup and the slab page the allocator wants to use
> > doesn't have an objcg array yet, we can allocate it on the fly, no?
>
> Well, arguably it can be done, but there are few drawbacks:
>
> 1) On the release path you'll need to make some extra work even for
> root allocations: calculate the offset only to find the NULL objcg pointer.
>
> 2) There will be a memory overhead for root allocations
> (which might or might not be compensated by the increase
> of the slab utilization).

Those two are only true if there is a wild mix of root and cgroup
allocations inside the same slab, and that doesn't really happen in
practice. Either the machine is dedicated to one workload and cgroups
are only enabled due to e.g. a vendor kernel, or you have cgrouped
systems (like most distro systems now) that cgroup everything.

> 3) I'm working on percpu memory accounting that resembles the same scheme,
> except that obj_cgroups vector is created for the whole percpu block.
> There will be root- and memcg-blocks, and it will be expensive to merge them.
> I kinda like using the same scheme here and there.

It's hard to conclude anything based on this information alone. If
it's truly expensive to merge them, then it warrants the additional
complexity. But I don't understand the desire to share a design for
two systems with sufficiently different constraints.

> Upsides?
>
> 1) slab utilization might increase a little bit (but I doubt it will have
> a huge effect, because both merging sets should be relatively big and well
> utilized)

Right.

> 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> but there isn't so much code left anyway.

There is a lot of complexity associated with the cache cloning that
isn't the lines of code, but the lifetime and synchronization rules.

And these two things are the primary aspects that make my head hurt
trying to review this patch series.

> So IMO it's an interesting direction to explore, but not something
> that necessarily has to be done in the context of this patchset.

I disagree. Instead of replacing the old coherent model and its
complexities with a new coherent one, you are mixing the two. And I
can barely understand the end result.

Dynamically cloning entire slab caches for the sole purpose of telling
whether the pages have an obj_cgroup array or not is *completely
insane*. If the controller had followed the obj_cgroup design from the
start, nobody would have ever thought about doing it like this.

>From a maintainability POV, we cannot afford merging it in this form.