Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure
From: Kirill Tkhai
Date: Fri Apr 13 2018 - 07:29:26 EST
On 13.04.2018 14:20, Michal Hocko wrote:
> On Fri 13-04-18 14:06:40, Kirill Tkhai wrote:
>> On 13.04.2018 14:02, Michal Hocko wrote:
>>> On Fri 13-04-18 12:35:22, Kirill Tkhai wrote:
>>>> On 13.04.2018 11:55, Michal Hocko wrote:
>>>>> On Thu 12-04-18 17:52:04, Kirill Tkhai wrote:
>>>>> [...]
>>>>>> @@ -4471,6 +4477,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>>>>>>
>>>>>> return &memcg->css;
>>>>>> fail:
>>>>>> + mem_cgroup_id_remove(memcg);
>>>>>> mem_cgroup_free(memcg);
>>>>>> return ERR_PTR(-ENOMEM);
>>>>>> }
>>>>>
>>>>> The only path which jumps to fail: here (in the current mmotm tree) is
>>>>> error = memcg_online_kmem(memcg);
>>>>> if (error)
>>>>> goto fail;
>>>>>
>>>>> AFAICS and the only failure path in memcg_online_kmem
>>>>> memcg_id = memcg_alloc_cache_id();
>>>>> if (memcg_id < 0)
>>>>> return memcg_id;
>>>>>
>>>>> I am not entirely clear on memcg_alloc_cache_id but it seems we do clean
>>>>> up properly. Or am I missing something?
>>>>
>>>> memcg_alloc_cache_id() may allocate a lot of memory, in case of the system reached
>>>> memcg_nr_cache_ids cgroups. In this case it iterates over all LRU lists, and double
>>>> size of every of them. In case of memory pressure it can fail. If this occurs,
>>>> mem_cgroup::id is not unhashed from IDR and we leak this id.
>>>
>>> OK, my bad I was looking at the bad code path. So you want to clean up
>>> after mem_cgroup_alloc not memcg_online_kmem. Now it makes much more
>>> sense. Sorry for the confusion on my end.
>>>
>>> Anyway, shouldn't we do the thing in mem_cgroup_free() to be symmetric
>>> to mem_cgroup_alloc?
>>
>> We can't, since it's called from mem_cgroup_css_free(), which doesn't have a deal
>> with idr freeing. All the asymmetry, we see, is because of the trick to unhash ID
>> earlier, then from mem_cgroup_css_free().
>
> Are you sure. It's been some time since I've looked at the quite complex
> cgroup tear down code but from what I remember, css_free is called on
> the css release (aka when the reference count drops to zero). mem_cgroup_id_put_many
> seems to unpin the css reference so we should have idr_remove by the
> time when css_free is called. Or am I still wrong and should go over the
> brain hurting cgroup removal code again?
mem_cgroup_id_put_many() unpins css, but this may be not the last reference to the css.
Thus, we release ID earlier, then all references to css are freed.
You may look at the commit 73f576c04b94, and it describes the reason we do that earlier:
Author: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Wed Jul 20 15:44:57 2016 -0700
mm: memcontrol: fix cgroup creation failure after many small jobs
The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears. At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild. Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs. Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.
...
Kirill