Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg

From: Balbir Singh
Date: Fri May 04 2018 - 00:59:33 EST


On Fri, May 4, 2018 at 1:11 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> Balbir Singh <bsingharora@xxxxxxxxx> writes:
>
>> On Tue, 01 May 2018 12:35:16 -0500
>> ebiederm@xxxxxxxxxxxx (Eric W. Biederman) wrote:
>>
>>> Recently it was reported that mm_update_next_owner could get into
>>> cases where it was executing it's fallback for_each_process part of
>>> the loop and thus taking up a lot of time.
>>>
>>> To deal with this replace mm->owner with mm->memcg. This just reduces
>>> the complexity of everything. As much as possible I have maintained
>>> the current semantics. There are two siginificant exceptions. During
>>> fork the memcg of the process calling fork is charged rather than
>>> init_css_set. During memory cgroup migration the charges are migrated
>>> not if the process is the owner of the mm, but if the process being
>>> migrated has the same memory cgroup as the mm.
>>>
>>> I believe it was a bug if init_css_set is charged for memory activity
>>> during fork, and the old behavior was simply a consequence of the new
>>> task not having tsk->cgroup not initialized to it's proper cgroup.
>>
>> That does sound like a bug, I guess we've not seen it because we did
>> not track any slab allocations initially.
>
>
>>> Durhing cgroup migration only thread group leaders are allowed to
>>> migrate. Which means in practice there should only be one. Linux
>>> tasks created with CLONE_VM are the only exception, but the common
>>> cases are already ruled out. Processes created with vfork have a
>>> suspended parent and can do nothing but call exec so they should never
>>> show up. Threads of the same cgroup are not the thread group leader
>>> so also should not show up. That leaves the old LinuxThreads library
>>> which is probably out of use by now, and someone doing something very
>>> creative with cgroups, and rolling their own threads with CLONE_VM.
>>> So in practice I don't think the difference charge migration will
>>> affect anyone.
>>>
>>> To ensure that mm->memcg is updated appropriately I have implemented
>>> cgroup "attach" and "fork" methods. This ensures that at those
>>> points the mm pointed to the task has the appropriate memory cgroup.
>>>
>>> For simplicity instead of introducing a new mm lock I simply use
>>> exchange on the pointer where the mm->memcg is updated to get
>>> atomic updates.
>>>
>>> Looking at the history effectively this change is a revert. The
>>> reason given for adding mm->owner is so that multiple cgroups can be
>>> attached to the same mm. In the last 8 years a second user of
>>> mm->owner has not appeared. A feature that has never used, makes the
>>> code more complicated and has horrible worst case performance should
>>> go.
>>
>> The idea was to track the mm to the right cgroup, we did find that
>> the mm could be confused as belonging to two cgroups. tsk->cgroup is
>> not sufficient and if when the tgid left, we needed an owner to track
>> where the current allocations were. But this is from 8 year old history,
>> I don't have my notes anymore :)
>
> I was referring to the change 8ish years ago where mm->memcg was
> replaced with mm->owner. Semantically the two concepts should both
> be perfectly capable of resolving which cgroup the mm belongs to.
>

Yep, agreed.

>>> +static void mem_cgroup_attach(struct cgroup_taskset *tset)
>>> +{
>>> + struct cgroup_subsys_state *css;
>>> + struct task_struct *tsk;
>>> +
>>> + cgroup_taskset_for_each(tsk, css, tset) {
>>> + struct mem_cgroup *new = mem_cgroup_from_css(css);
>>> + css_get(css);
>>> + task_update_memcg(tsk, new);
>>
>> I'd have to go back and check and I think your comment refers to this,
>> but we don't expect non tgid tasks to show up here? My concern is I can't
>> find the guaratee that task_update_memcg(tsk, new) is not
>>
>> 1. Duplicated for each thread in the process or attached to the mm
>> 2. Do not update mm->memcg to point to different places, so the one
>> that sticks is the one that updated things last.
>
> For cgroupv2 which only operates on processes we have such a guarantee.
>
> There is no such guarantee for cgroupv1. But it would take someone
> being crazy to try this.
>
> We can add a guarantee to can_attach that we move all of the threads in
> a process, and we probably should. However having mm->memcg is more
> important structurally than what crazy we let in. So let's make this
> change first as safely as we can, and then we don't loose important
> data structure simplications if it turns out we have to revert a change
> to make the memcgroup per process in cgroupv1.
>
>
> There are some serious issues with the intereactions between the memory
> control group and the concept of thread group leader, that show up when
> you consider a zombie thread group leader that has called cgroup_exit.
> So I am not anxious to stir in concepts like thread_group_leader into
> new code unless there is a very good reason.
>
> We don't exepect crazy but the code allows it, and I have not changed
> that.
>

OK, lets stick with this

Acked-by: Balbir Singh <bsingharora@xxxxxxxxx>

Balbir Singh