Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

From: Mathias Krause
Date: Wed Nov 03 2021 - 05:51:56 EST


Hi!

Am 02.11.21 um 17:02 schrieb Michal Koutný:
> [snip]
> I have a reproducer (attached) that can hit this window quite easily
> after tuning. I can observe consequences of it even with a recent 5.15
> kernel. (And I also have reports from real world workloads failing due
> to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle").)

Thanks for the reproducer!

To provide yet another data point, Kevin (on Cc) is running into this
bug as well very reliable with a production workload, so we started
looking into this too. His crashes indicate a use-after-free of a cfs_rq
in update_blocked_averages(), much like you already diagnosed in your
initial patch description -- there are live cfs_rq's (on_list=1) in an
about to be kfree()'d task group in free_fair_sched_group().

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object which
makes it incidentally overlap with SLUB's freelist pointer. That in
combination with SLAB_FREELIST_HARDENED's freelist pointer mangling
leads to a reliable access violation in form of a #GP which allowed us
to make the UAF fail fast.

As the real root cause cannot be seen from the crash backtrace only, we
tested a debug patch (attached) that unveiled that the real offender is
tg_unthrottle_up() getting called via sched_cfs_period_timer() via the
timer interrupt at an inconvenient time, namely when
unregister_fair_sched_group() unlinks all cfs_rq's from the dying task
group. It doesn't protect itself from getting interrupted, so if the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group in sched_offline_group(), sched_cfs_period_timer() will
execute and walk the list of task groups, trying to unthrottle cfs_rq's
and possibly re-add them to the dying task group. These will later -- in
free_fair_sched_group() -- be kfree()'ed while still being linked,
leading to the fireworks Kevin and you are seeing.

We tried the below patch which, unfortunately, doesn't fix the issue. So
there must be something else. :(

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978460f891a1..afee07e9faf9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us
+ * (via sched_cfs_period_timer()).
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
spin_unlock_irqrestore(&task_group_lock, flags);
+
+ /* End participation in shares distribution: */
+ unregister_fair_sched_group(tg);
}

static void sched_change_group(struct task_struct *tsk, int type)

> [snip]
>
> The possibilities for the current problem:
>
> 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
> (Not exclusive with the other suggestions, rather a stop-gap for the
> time being.)
>
> 2) Don't add offlined task_groups into the undecayed list
> - Your proposal with overloaded on_list=2 could serve as mark of that,
> but it's a hack IMO.

> - Proper way (tm) would be to use css_tryget_online() and css_put() when
> dealing with the list (my favorite at the moment).

That might work, as -- at least in Kevin's case -- it all gets triggered
by a dying cgroup.

> 3) Narrowing the race-window dramatically
> - that is by moving list removal from unregister_fair_sched_group() to
> free_fair_sched_group(),

That might work, too. However, the unlinking needs protection against
the timer interrupt (and other sources?) which might try to re-add
entries. Or won't that happen any more, as at lesat one RCU GP has
passed? Anyhow, the kfree() calls likely would need to become
kfree_rcu() to handle concurrent traversal of cfs_rq's.

> - <del>or use list_empty(tg->list) as indicator whether we're working
> with onlined task_group.</del> (won't work for RCU list)
>
> 4) Rework how throttled load is handled (hand waving)
> - There is remove_entity_load_avg() that moves the load to parent upon
> final removal. Maybe it could be generalized for temporary removals by
> throttling (so that unthrottling could again add only non-empty
> cfs_rqs to the list and undecayed load won't skew fairness).
> - or the way of [1].
>
> 5) <your ideas>

It should be something that can be backported to stable kernels, as this
seem to affect v5.13, too.


Thanks,
Mathias
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 567c571d624f..94d1fe0aec5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -313,6 +313,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* the list, this means to put the child at the tail
* of the list that starts by parent.
*/
+ WARN(cfs_rq->tg->parent->dead, "parent marked dead! (cfs_rq = %px, parent = %px)", cfs_rq, cfs_rq->tg->parent);
list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
&(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
/*
@@ -329,6 +330,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* cfs rq without parent should be put
* at the tail of the list.
*/
+ WARN(cfs_rq->tg->dead, "tg marked dead! (cfs_rq = %px, tg = %px)", cfs_rq, cfs_rq->tg);
list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
&rq->leaf_cfs_rq_list);
/*
@@ -345,6 +347,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* tmp_alone_branch points to the begin of the branch
* where we will add parent.
*/
+ WARN(cfs_rq->tg->dead, "tg marked dead! (cfs_rq = %px, tg = %px)", cfs_rq, cfs_rq->tg);
list_add_rcu(&cfs_rq->leaf_cfs_rq_list, rq->tmp_alone_branch);
/*
* update tmp_alone_branch to points to the new begin
@@ -3277,6 +3280,7 @@ static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
prev = rq->tmp_alone_branch;
}

+ BUG_ON(prev == LIST_POISON2);
prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);

return (prev_cfs_rq->tg->parent == cfs_rq->tg);
@@ -11369,6 +11373,8 @@ void unregister_fair_sched_group(struct task_group *tg)
struct rq *rq;
int cpu;

+ WARN(tg->dead, "tg already dead! (tg = %px)", tg);
+ tg->dead = 1;
for_each_possible_cpu(cpu) {
if (tg->se[cpu])
remove_entity_load_avg(tg->se[cpu]);
@@ -11386,6 +11392,10 @@ void unregister_fair_sched_group(struct task_group *tg)
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
raw_spin_rq_unlock_irqrestore(rq, flags);
}
+
+ for_each_possible_cpu(cpu) {
+ WARN(tg->cfs_rq[cpu]->on_list, "found on_list cfs_rq! (cpu = %d, cfs_rq = %px, tg = %px)", cpu, tg->cfs_rq[cpu], tg);
+ }
}

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eaca971a3ee2..218dead1f4db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -395,6 +395,7 @@ struct task_group {
/* runqueue "owned" by this group on each CPU */
struct cfs_rq **cfs_rq;
unsigned long shares;
+ unsigned char dead;

#ifdef CONFIG_SMP
/*