Re: [BUG] sched: leaf_cfs_rq_list use after free

From: Tejun Heo
Date: Wed Mar 16 2016 - 10:24:21 EST

Hello, Peter.

Sorry about the delay.

On Mon, Mar 14, 2016 at 01:09:03PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote:
> > So I would suggest TJ to revert that patch and queue it for stable.
> >
> > It it clearly borken, because cgroup_exit() is called from preemptible
> > context, so _obviously_ we can (and clearly will) schedule after that,
> > which is somewhat hard if the cgroup we're supposedly belonging to has
> > been torn to shreds in the meantime.
> And I think it might be fundamentally broken, because it leaves ->css
> set to whatever cgroup we had, while simultaneously allowing that css to
> go away.

So, the problem here is that cpu is using css_offline to tear down the
css. perf shouldn't have the same problem as it destroys its css from
css_free. The distinctions among different callbacks evolved over
time and we need to update the documentation but here are the rules.


This one is obvious. Alloc and init. The css will become
visible during css iteration sometime between alloc and


The css is now guaranteed to be visible for css_for_each*()
iterations. This distinction exists because some controllers
need to propagate state changes downwards requiring a new css
to become visible before it inherits the current state from
the parent. Conversely, there's no reason to use this
callback if there's no such requirement.

Ex: Freezer which propagates the target state downwards and
needs a new child to inherit the current state while


The css is going down and draining usages by refusing new
css_tryget_online()'s but still guaranteed to be visible
during css iterations. Controllers which explicitly needs to
put, say, cache references or need to perform cleanup
operations while the css is iteratable need to use this
method; otherwise, no reason to bother with it.

Ex: blkcg pins csses as part of lookup cache which can prevent
a css from being drained and released, so blkcg uses this call
back to disable caching for the css.


The css is drained and not visible during iteration and will
be freed after a RCU grace period. Used by controllers which
cache pointers to csses being drained.

Ex: memcg needs to iterate csses which are being drained and
its custom iterator implementation caches RCU protected
pointers to csses. memcg uses this callback to shoot down
those cached pointers.


The css is drained, not visible to iterations, not used by
anyone, and will be freed immediately.

So, controllers which don't have persistent states or synchronization
requirements around state propagation have no reason to bother with
all the rest. css_alloc() and css_free() are the right callbacks to
init and tear down csses. If a controller has specific needs to
propagate states, only the part of operations which are affected
should be in the respective specialized init/exit methods.

> This means that anything trying to use this pointer; and there's quite a
> lot of that; is in for a nasty surprise.
> So you really need to change the ->css, either when the task starts
> dying (like it used to), or otherwise right before the cgroup goes
> offline.

So, the rule is that the css should stay serviceable until everyone
depending on it is gone.

> The argument used was that people want to see resources consumed by
> Zombies, which is all fine and dandy, but when you destroy the cgroup
> you cannot see that anyway.
> So something needs to fundamentally ensure that ->css changes before we
> go offline the thing.

I could be mistaken but AFAICS there doesn't seem to be anything
requiring bothering with the more specialized exit methods. Given
that no css iteration is used and everything is lock protected, the
patch at the end of this messages should do and seems to work fine
here. Am I missing something?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f5abc6..98d019d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8332,14 +8332,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
struct task_group *tg = css_tg(css);

- sched_destroy_group(tg);
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
- struct task_group *tg = css_tg(css);
+ sched_destroy_group(tg);

static void cpu_cgroup_fork(struct task_struct *task)
@@ -8701,7 +8695,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_free = cpu_cgroup_css_free,
.css_online = cpu_cgroup_css_online,
- .css_offline = cpu_cgroup_css_offline,
.fork = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,