Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's
From: Vincent Guittot
Date: Wed Nov 10 2021 - 10:14:22 EST
On Mon, 8 Nov 2021 at 16:06, Mathias Krause <minipli@xxxxxxxxxxxxxx> wrote:
>
> Am 08.11.21 um 12:40 schrieb Peter Zijlstra:
> > On Mon, Nov 08, 2021 at 11:27:57AM +0100, Mathias Krause wrote:
> >
> >> The timers need to be destroyed prior to unregister_fair_sched_group()
> >> via destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)), i.e. move it from
> >> free_fair_sched_group() to here, as I did in my patch. Otherwise the tg
> >> might still be messed with and we don't want that.
> >
> > Oh, duh, yes.
>
> Well, still slightly wrong: "prior to unregister_fair_sched_group()"
> means calling destroy_cfs_bandwidth() before not after ;)
>
> > For consistency's sake, I've also added an unregister_*
> > for the rt class, also destroying the bandwidth timer.
>
> Looks good to me. Not strictly needed by the code as of now, but
> shouldn't hurt either, to defer the final kfree() to the next RCU GP.
>
> >> Beside that, looks good to me. Will you create a new proper patch or
> >> should I do it?
> >
> > Something like so good?
> >
> > (I stripped the #PF splat, because I don't think it adds anything not
> > covered by the text).
>
> Well, me, personally, always searches for parts of a Oops splat first.
> It sometimes finds related discussions or, even better, commits fixing
> an issue. So I prefer keeping it. But, in this case, it should find this
> Email thread and, in turn, this patch. So I'm fine with dropping it.
>
> > ---
> > Subject: sched/fair: Prevent dead task groups from regaining cfs_rq's
> > From: Mathias Krause <minipli@xxxxxxxxxxxxxx>
> > Date: Wed, 3 Nov 2021 20:06:13 +0100
> >
> > From: Mathias Krause <minipli@xxxxxxxxxxxxxx>
> >
> > Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> > in update_blocked_averages(). Initial debugging revealed that we've
> > live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> > free_fair_sched_group(). However, it was unclear how that can happen.
> >
> > His kernel config happened to lead to a layout of struct sched_entity
> > that put the 'my_q' member directly into the middle of the object
> > which makes it incidentally overlap with SLUB's freelist pointer.
> > That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
> > mangling, leads to a reliable access violation in form of a #GP which
> > made the UAF fail fast.
> >
> > Michal seems to have run into the same issue[1]. He already correctly
> > diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
> > cfs_rq's to list on unthrottle") is causing the preconditions for the
> > UAF to happen by re-adding cfs_rq's also to task groups that have no
> > more running tasks, i.e. also to dead ones. His analysis, however,
> > misses the real root cause and it cannot be seen from the crash
> > backtrace only, as the real offender is tg_unthrottle_up() getting
> > called via sched_cfs_period_timer() via the timer interrupt at an
> > inconvenient time.
> >
> > When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
> > task group, it doesn't protect itself from getting interrupted. If the
> > timer interrupt triggers while we iterate over all CPUs or after
> > unregister_fair_sched_group() has finished but prior to unlinking the
> > task group, sched_cfs_period_timer() will execute and walk the list of
> > task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
> > dying task group. These will later -- in free_fair_sched_group() -- be
> > kfree()'ed while still being linked, leading to the fireworks Kevin
> > and Michal are seeing.
> >
> > To fix this race, ensure the dying task group gets unlinked first.
> > However, simply switching the order of unregistering and unlinking the
> > task group isn't sufficient, as concurrent RCU walkers might still see
> > it, as can be seen below:
> >
> > CPU1: CPU2:
> > : timer IRQ:
> > : do_sched_cfs_period_timer():
> > : :
> > : distribute_cfs_runtime():
> > : rcu_read_lock();
> > : :
> > : unthrottle_cfs_rq():
> > sched_offline_group(): :
> > : walk_tg_tree_from(…,tg_unthrottle_up,…):
> > list_del_rcu(&tg->list); :
> > (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
> > : :
> > (2) list_del_rcu(&tg->siblings); :
> > : tg_unthrottle_up():
> > unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > : :
> > list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
> > : :
> > : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> > (3) : list_add_leaf_cfs_rq(cfs_rq);
> > : :
> > : :
> > : :
> > : :
> > : :
> > (4) : rcu_read_unlock();
> >
> > CPU 2 walks the task group list in parallel to sched_offline_group(),
> > specifically, it'll read the soon to be unlinked task group entry at
> > (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
> > still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
> > all cfs_rq's via list_del_leaf_cfs_rq() in
> > unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
> > these at (3), which is the cause of the UAF later on.
> >
> > To prevent this additional race from happening, we need to wait until
> > walk_tg_tree_from() has finished traversing the task groups, i.e.
> > after the RCU read critical section ends in (4). Afterwards we're safe
> > to call unregister_fair_sched_group(), as each new walk won't see the
> > dying task group any more.
> Replace the following paragraph, which is outdated by now,...:
>
> > Using synchronize_rcu() might be seen as a too heavy hammer to nail
> > this problem. However, the overall tear down sequence (e.g., as
> > documented in css_free_rwork_fn()) already relies on quite a few
> > assumptions regarding execution context and RCU grace periods from
> > passing. Looking at the autogroup code, which calls
> > sched_destroy_group() directly after sched_offline_group() and the
> > apparent need to have at least one RCU grace period expire after
> > unlinking the task group, prior to calling
> > unregister_fair_sched_group(), there seems to be no better
> > alternative. Calling unregister_fair_sched_group() via call_rcu()
> > will only lead to trouble in sched_offline_group() which also relies
> > on (yet another) expired RCU grace period.
>
> ...with something like this (already mentioned in the code, btw):
>
> On top of that, we need to wait yet another RCU grace period after
> unregister_fair_sched_group() to ensure print_cfs_stats(), which might
> run concurrently, always sees valid objects, i.e. not already free'd ones.
>
> >
> > This patch survives Michal's reproducer[2] for 8h+ now, which used to
> > trigger within minutes before.
> >
> > [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@xxxxxxxx/
> > [2] https://lore.kernel.org/lkml/20211102160228.GA57072@xxxxxxxxxxxxxxxxx/
> >
> > Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> > Reported-by: Kevin Tanguy <kevin.tanguy@xxxxxxxxxxxx>
> > Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx>
> > [peterz: shuffle code around a bit]
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> > ---
> > kernel/sched/autogroup.c | 2 +-
> > kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++++----------
> > kernel/sched/fair.c | 4 ++--
> > kernel/sched/rt.c | 12 +++++++++---
> > kernel/sched/sched.h | 3 ++-
> > 5 files changed, 50 insertions(+), 17 deletions(-)
> >
> > --- a/kernel/sched/autogroup.c
> > +++ b/kernel/sched/autogroup.c
> > @@ -31,7 +31,7 @@ static inline void autogroup_destroy(str
> > ag->tg->rt_se = NULL;
> > ag->tg->rt_rq = NULL;
> > #endif
> > - sched_offline_group(ag->tg);
> > + sched_release_group(ag->tg);
> > sched_destroy_group(ag->tg);
> > }
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9716,6 +9716,22 @@ static void sched_free_group(struct task
> > kmem_cache_free(task_group_cache, tg);
> > }
> >
> > +static void sched_free_group_rcu(struct rcu_head *rcu)
> > +{
> > + sched_free_group(container_of(rcu, struct task_group, rcu));
> > +}
> > +
> > +static void sched_unregister_group(struct task_group *tg)
> > +{
> > + unregister_fair_sched_group(tg);
> > + unregister_rt_sched_group(tg);
> > + /*
> > + * We have to wait for yet another RCU grace period to expire, as
> > + * print_cfs_stats() might run concurrently.
> > + */
> > + call_rcu(&tg->rcu, sched_free_group_rcu);
> > +}
> > +
> > /* allocate runqueue etc for a new task group */
> > struct task_group *sched_create_group(struct task_group *parent)
> > {
> > @@ -9736,7 +9752,7 @@ struct task_group *sched_create_group(st
> > return tg;
> >
> > err:
> > - sched_free_group(tg);
> > + sched_unregister_group(tg);
>
> This can stay sched_free_group() as neither have the bandwidth timers
> been started yet, nor was this tg made visible outside of this function.
> So omitting the calls to destroy_{cfs,rt}_bandwidth() isn't a problem --
> timers aren't running yet.
>
> > return ERR_PTR(-ENOMEM);
> > }
> >
> > @@ -9759,25 +9775,35 @@ void sched_online_group(struct task_grou
> > }
> >
> > /* rcu callback to free various structures associated with a task group */
> > -static void sched_free_group_rcu(struct rcu_head *rhp)
> > +static void sched_unregister_group_rcu(struct rcu_head *rhp)
> > {
> > /* Now it should be safe to free those cfs_rqs: */
> > - sched_free_group(container_of(rhp, struct task_group, rcu));
> > + sched_unregister_group(container_of(rhp, struct task_group, rcu));
> > }
> >
> > void sched_destroy_group(struct task_group *tg)
> > {
> > /* Wait for possible concurrent references to cfs_rqs complete: */
> > - call_rcu(&tg->rcu, sched_free_group_rcu);
> > + call_rcu(&tg->rcu, sched_unregister_group_rcu);
> > }
> >
> > -void sched_offline_group(struct task_group *tg)
> > +void sched_release_group(struct task_group *tg)
> > {
> > unsigned long flags;
> >
> > - /* End participation in shares distribution: */
> > - unregister_fair_sched_group(tg);
> > -
> > + /*
> > + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> > + * sched_cfs_period_timer()).
> > + *
> > + * For this to be effective, we have to wait for all pending users of
> > + * this task group to leave their RCU critical section to ensure no new
> > + * user will see our dying task group any more. Specifically ensure
> > + * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
> > + *
> > + * We therefore defer calling unregister_fair_sched_group() to
> > + * sched_unregister_group() which is guarantied to get called only after the
> > + * current RCU grace period has expired.
> > + */
> > spin_lock_irqsave(&task_group_lock, flags);
> > list_del_rcu(&tg->list);
> > list_del_rcu(&tg->siblings);
> > @@ -9896,7 +9922,7 @@ static void cpu_cgroup_css_released(stru
> > {
> > struct task_group *tg = css_tg(css);
> >
> > - sched_offline_group(tg);
> > + sched_release_group(tg);
> > }
> >
> > static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> > @@ -9906,7 +9932,7 @@ static void cpu_cgroup_css_free(struct c
> > /*
> > * Relies on the RCU grace period between css_released() and this.
> > */
> > - sched_free_group(tg);
> > + sched_unregister_group(tg);
> > }
> >
> > /*
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_g
> > {
> > int i;
> >
> > - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
> > -
> > for_each_possible_cpu(i) {
> > if (tg->cfs_rq)
> > kfree(tg->cfs_rq[i]);
> > @@ -11551,6 +11549,8 @@ void unregister_fair_sched_group(struct
> > list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> > raw_spin_rq_unlock_irqrestore(rq, flags);
> > }
>
> > +
> > + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
>
> Move that hunk to the beginning of unregister_fair_sched_group() and
> we're good.
With Mathias comments, your proposal looks ok for me as well. I don't
have any reproducer so it's hard to test it
>
> > }
> >
> > void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(str
> > return rt_rq->rq;
> > }
> >
> > -void free_rt_sched_group(struct task_group *tg)
> > +void unregister_rt_sched_group(struct task_group *tg)
> > {
> > - int i;
> > -
> > if (tg->rt_se)
> > destroy_rt_bandwidth(&tg->rt_bandwidth);
> >
> > +}
> > +
> > +void free_rt_sched_group(struct task_group *tg)
> > +{
> > + int i;
> > +
> > for_each_possible_cpu(i) {
> > if (tg->rt_rq)
> > kfree(tg->rt_rq[i]);
> > @@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(
> > return &rq->rt;
> > }
> >
> > +void unregister_rt_sched_group(struct task_group *tg) { }
> > +
> > void free_rt_sched_group(struct task_group *tg) { }
> >
> > int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runti
> > extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> > extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
> >
> > +extern void unregister_rt_sched_group(struct task_group *tg);
> > extern void free_rt_sched_group(struct task_group *tg);
> > extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
> > extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> > @@ -503,7 +504,7 @@ extern struct task_group *sched_create_g
> > extern void sched_online_group(struct task_group *tg,
> > struct task_group *parent);
> > extern void sched_destroy_group(struct task_group *tg);
> > -extern void sched_offline_group(struct task_group *tg);
> > +extern void sched_release_group(struct task_group *tg);
> >
> > extern void sched_move_task(struct task_struct *tsk);
> >
> >
>
> Thanks,
> Mathias