Re: [PATCH 1/8] memcg: export kmemcg cache id via cgroup fs

From: Vladimir Davydov
Date: Mon Feb 03 2014 - 08:00:27 EST


On 02/03/2014 03:04 PM, David Rientjes wrote:
> On Mon, 3 Feb 2014, Vladimir Davydov wrote:
>
>> AFAIU, cgroup identifiers dumped on oom (cgroup paths, currently) and
>> memcg slab cache names serve for different purposes.
> Sure, you may dump the name for a number of legitimate reasons, but the
> problem still exists that it's difficult to determine what memcg is being
> referenced without a flat hierarchy and unique memcg names for all
> children.
>
>> The point is oom is
>> a perfectly normal situation for the kernel, and info dumped to dmesg is
>> for admin to find out the cause of the problem (a greedy user or
>> cgroup).
> Hmm, so if we hand out top-level memcgs to individual jobs or users, like
> our userspace does, and they are able to configure their child memcgs as
> they wish, and then they or the admin finds in the kernel log that a
> memory hog was killed from the memcg with the perfectly anonymous memcg
> name of "memcg", how do we determine what job or user triggered that kill?
> User id is not going to be conclusive in a production environment with
> shared user accounts.
>
>> On the other hand, slab cache names are dumped to dmesg only on
>> extraordinary situations - like bugs in slab implementation, or double
>> free, or detected memory leaks - where we usually do not need the name
>> of the memcg that triggered the problem, because the bug is likely to be
>> in the kernel subsys using the cache.
> There's certainly overlap here since slab leaks triggered by a particular
> workload, perhaps by usage of a particular syscall, can occur and cause
> oom killing but the problem remains that neither the memcg name nor the
> slab cache name may be conclusive to determine what job or user triggered
> the issue. That's why we make strict demands that memcg names are always
> unique and encode several key values to identify the user and job and we
> don't rely on the parent.
>
> I can also see the huge maintenance burden it would be to keep around a
> mapping of kmem ids to {user, job} pairs just in case we later identify a
> problem and in 99% of the cases would be just wasted storage.
>
>> Plus, the names are exported to
>> sysfs in case of slub, again for debugging purposes, AFAIK. So IMO the
>> use cases for oom vs slab names are completely different - information
>> vs debugging - and I want to export kmem.id only for the ability of
>> debugging kmemcg and slab subsystems.
>>
> Eeek, I'm not sure I agree. I've often found that reproducing rare slab
> issues is very difficult without knowledge of the workload so that I can
> reproduce it. Whereas X is a very large number of machines and we see
> this issue on 0.0001% of X machines, I would be required to enable this
> "debugging" aid unconditionally to ever be able to map the stored kmem id
> back to a user and job, that mapping would be extremely costly to
> maintain, and we've gained nothing if we had already demanded that
> userspace identify their memcg names with unique identifiers regardless of
> where they are in the hierarchy.

I see your point, and it sounds quite reasonable to me. So I guess I'll
drop the patch removing the cgroup name part from slab cache names
(patch 2) and resend.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/