Re: [PATCH bpf-next] bpf: fix cgroup bpf release synchronization
From: Alexei Starovoitov
Date: Tue Jun 25 2019 - 11:51:52 EST
On 6/23/19 9:02 PM, Roman Gushchin wrote:
> On Sun, Jun 23, 2019 at 08:29:21PM -0700, Alexei Starovoitov wrote:
>> On 6/23/19 7:30 PM, Roman Gushchin wrote:
>>> Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
>>> from cgroup itself"), cgroup_bpf release occurs asynchronously
>>> (from a worker context), and before the release of the cgroup itself.
>>>
>>> This introduced a previously non-existing race between the release
>>> and update paths. E.g. if a leaf's cgroup_bpf is released and a new
>>> bpf program is attached to the one of ancestor cgroups at the same
>>> time. The race may result in double-free and other memory corruptions.
>>>
>>> To fix the problem, let's protect the body of cgroup_bpf_release()
>>> with cgroup_mutex, as it was effectively previously, when all this
>>> code was called from the cgroup release path with cgroup mutex held.
>>>
>>> Also make sure, that we don't leave already freed pointers to the
>>> effective prog arrays. Otherwise, they can be released again by
>>> the update path. It wasn't necessary before, because previously
>>> the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
>>> itself were released together.
>>
>> I thought dying cgroup won't have any children cgroups ?
>
> It's not completely true, a dying cgroup can't have living children.
>
>> It should have been empty with no tasks inside it?
>
> Right.
>
>> Only some resources are still held?
>
> Right.
>
>> mutex and zero init are highly suspicious.
>> It feels that cgroup_bpf_release is called too early.
>
> An alternative solution is to bump the refcounter on
> every update path, and explicitly skip de-bpf'ed cgroups.
>
>>
>> Thinking from another angle... if child cgroups can still attach then
>> this bpf_release is broken.
>
> Hm, what do you mean under attach? It's not possible to attach
> a new prog, but if a prog is attached to a parent cgroup,
> a pointer can spill through "effective" array.
>
> But I agree, it's broken. Update path should ignore such
> cgroups (cgroups, which cgroup_bpf was released). I'll take a look.
>
>> The code should be
>> calling __cgroup_bpf_detach() one by one to make sure
>> update_effective_progs() is called, since descendant are still
>> sort-of alive and can attach?
>
> Not sure I get you. Dying cgroup is a leaf cgroup.
>
>>
>> My money is on 'too early'.
>> May be cgroup is not dying ?
>> Just cgroup_sk_free() is called on the last socket and
>> this auto-detach logic got triggered incorrectly?
>
> So, once again, what's my picture:
>
> A
> A/B
> A/B/C
>
> cpu1: cpu2:
> rmdir C attach new prog to A
> C got dying update A, update B, update C...
> C's cgroup_bpf is released C's effective progs is replaced with new one
> old is double freed
>
> It looks like it can be reproduced without any sockets.
I see.
Does it mean that css_for_each_descendant walks dying cgroups ?
I guess the fix then is to avoid walking them in update_effective_progs ?