Re: [PATCH 1/8] memcg: export kmemcg cache id via cgroup fs

From: David Rientjes
Date: Mon Feb 03 2014 - 06:04:36 EST


On Mon, 3 Feb 2014, Vladimir Davydov wrote:

> AFAIU, cgroup identifiers dumped on oom (cgroup paths, currently) and
> memcg slab cache names serve for different purposes.

Sure, you may dump the name for a number of legitimate reasons, but the
problem still exists that it's difficult to determine what memcg is being
referenced without a flat hierarchy and unique memcg names for all
children.

> The point is oom is
> a perfectly normal situation for the kernel, and info dumped to dmesg is
> for admin to find out the cause of the problem (a greedy user or
> cgroup).

Hmm, so if we hand out top-level memcgs to individual jobs or users, like
our userspace does, and they are able to configure their child memcgs as
they wish, and then they or the admin finds in the kernel log that a
memory hog was killed from the memcg with the perfectly anonymous memcg
name of "memcg", how do we determine what job or user triggered that kill?
User id is not going to be conclusive in a production environment with
shared user accounts.

> On the other hand, slab cache names are dumped to dmesg only on
> extraordinary situations - like bugs in slab implementation, or double
> free, or detected memory leaks - where we usually do not need the name
> of the memcg that triggered the problem, because the bug is likely to be
> in the kernel subsys using the cache.

There's certainly overlap here since slab leaks triggered by a particular
workload, perhaps by usage of a particular syscall, can occur and cause
oom killing but the problem remains that neither the memcg name nor the
slab cache name may be conclusive to determine what job or user triggered
the issue. That's why we make strict demands that memcg names are always
unique and encode several key values to identify the user and job and we
don't rely on the parent.

I can also see the huge maintenance burden it would be to keep around a
mapping of kmem ids to {user, job} pairs just in case we later identify a
problem and in 99% of the cases would be just wasted storage.

> Plus, the names are exported to
> sysfs in case of slub, again for debugging purposes, AFAIK. So IMO the
> use cases for oom vs slab names are completely different - information
> vs debugging - and I want to export kmem.id only for the ability of
> debugging kmemcg and slab subsystems.
>

Eeek, I'm not sure I agree. I've often found that reproducing rare slab
issues is very difficult without knowledge of the workload so that I can
reproduce it. Whereas X is a very large number of machines and we see
this issue on 0.0001% of X machines, I would be required to enable this
"debugging" aid unconditionally to ever be able to map the stored kmem id
back to a user and job, that mapping would be extremely costly to
maintain, and we've gained nothing if we had already demanded that
userspace identify their memcg names with unique identifiers regardless of
where they are in the hierarchy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/