Re: kernfs memcg accounting

From: Roman Gushchin
Date: Tue May 10 2022 - 23:06:39 EST


On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin wrote:
> On 5/3/22 00:22, Michal Koutný wrote:
> > When struct mem_cgroup charging was introduced, there was a similar
> > discussion [1].
>
> Thank you, I'm missed this patch, it was very interesting and useful.
> I would note though, that OpenVZ and LXC have another usecase:
> we have separate and independent systemd instances inside OS containers.
> So container's cgroups are created not in host's root memcg but
> inside accountable container's root memcg.
>
> > I can see following aspects here:
> > 1) absolute size of kernfs_objects,
> > 2) practical difference between a) and b),
> > 3) consistency with memcg,
> > 4) v1 vs v2 behavior.
> ...
> > How do these reasonings align with your original intention of net
> > devices accounting? (Are the creators of net devices inside the
> > container?)
>
> It is possible to create netdevice in one namespace/container
> and then move them to another one, and this possibility is widely used.
> With my patch memory allocated by these devices will be not accounted
> to new memcg, however I do not think it is a problem.
> My patches protect the host mostly from misuse, when someone creates
> a huge number of nedevices inside a container.
>
> >> Do you think it is incorrect and new kernfs node should be accounted
> >> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?
> >
> > I don't think either variant is incorrect. I'd very much prefer the
> > consistency with memcg behavior (variant a)) but as I've listed the
> > arguments above, it seems such a consistency can't be easily justified.
>
> From my point of view it is most important to account allocated memory
> to any cgroup inside container. Select of proper memcg is a secondary goal here.
> Frankly speaking I do not see a big difference between memcg of current process,
> memcg of newly created child and memcg of its parent.
>
> As far as I understand, Roman chose the parent memcg because it was a special
> case of creating a new memory group. He temporally changed active memcg
> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
> allocations.

My primary goal was to apply the memory pressure on memory cgroups with a lot
of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
is way larger than a page, so a cgroup which looks small can be really large
if we calculate the amount of memory taken by all children memcg internals.

Applying this pressure to another cgroup (e.g. the one which contains systemd)
doesn't help to reclaim any pages which are pinning the dying cgroups.

For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
there is no such problem there.

For consistency reasons I'd suggest to charge all *large* allocations
(e.g. percpu) to the parent cgroup. Small allocations can be ignored.

Thanks!