Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's
From: Peter Zijlstra
Date: Fri Nov 05 2021 - 11:04:09 EST
TJ, long email below, but TL;DR, is it ok to do synchornize_rcu() from
a css_released callback? We can't use the css_offline callback because
commit 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
but we do need a second GP, as per the below.
On Wed, Nov 03, 2021 at 08:06:13PM +0100, Mathias Krause wrote:
> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> in update_blocked_averages(). Initial debugging revealed that we've live
> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> free_fair_sched_group(). However, it was unclear how that can happen.
>
> His kernel config happened to lead to a layout of struct sched_entity
> that put the 'my_q' member directly into the middle of the object which
> makes it incidentally overlap with SLUB's freelist pointer. That, in
> combination with SLAB_FREELIST_HARDENED's freelist pointer mangling,
> leads to a reliable access violation in form of a #GP which made the UAF
> fail fast, e.g. like this:
>
> general protection fault, probably for non-canonical address 0xff80f68888f69107: 0000 [#1] SMP
> CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.14.13-custom+ #3
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
> RIP: 0010:[<ffffffff81194af8>] update_blocked_averages+0x428/0xb90
> Code: 28 01 00 4c 8b 4c 24 10 41 8b 97 c4 00 00 00 85 d2 75 32 4c 89 fe 4c 89 cf e8 74 2c 01 00 49 8b 96 80 00 00 00 48 85 d2 74 0e <48> 83 ba 08 01 00 00 00 0f 85 45 01 00 00 85 c0 0f 84 34 fe ff ff
> RSP: 0018:ffffc9000002bf08 EFLAGS: 00010086
> RAX: 0000000000000001 RBX: ffff8880202eba00 RCX: 000000000000b686
> RDX: ff80f68888f68fff RSI: 0000000000000000 RDI: 000000000000000c
> RBP: ffff888006808a00 R08: ffff888006808a00 R09: 0000000000000000
> R10: 0000000000000008 R11: 0000000000000000 R12: ffff8880202ebb40
> R13: 0000000000000000 R14: ffff8880087e7f00 R15: ffff888006808a00
> FS: 0000000000000000(0000) GS:ffff888237d40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000315b147b8000 CR3: 0000000002d70000 CR4: 00000000001606f0 shadow CR4: 00000000001606f0
> Stack:
> 0100008237d58b80 0000000000000286 000003ae017023d5 00000000000000f0
> ffff888237d5d240 0000000000000028 ffff888237d5c980 ffff888237d5c900
> ffff888237d5c900 0000000000000001 0000000000000100 0000000000000007
> Call Trace:
> <IRQ>
> [<ffffffff8119952a>] run_rebalance_domains+0x3a/0x60
> [<ffffffff810030bf>] __do_softirq+0xbf/0x1fb
> [<ffffffff81162e7f>] irq_exit_rcu+0x7f/0x90
> [<ffffffff822d0d7e>] sysvec_apic_timer_interrupt+0x6e/0x90
> </IRQ>
> [<ffffffff81001d82>] asm_sysvec_apic_timer_interrupt+0x12/0x20
> RIP: 0010:[<ffffffff822debe9>] acpi_idle_do_entry+0x49/0x50
> Code: 8b 15 2f b3 75 01 ed c3 0f 0b e9 52 fd ff ff 65 48 8b 04 25 40 1c 01 00 48 8b 00 a8 08 75 e8 eb 07 0f 00 2d d9 e2 1d 00 fb f4 <fa> c3 0f 0b cc cc cc 41 55 41 89 d5 41 54 49 89 f4 55 53 48 89 fb
> RSP: 0000:ffffc900000bbe68 EFLAGS: 00000246
> RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff888237d40000
> RDX: 0000000000000001 RSI: ffffffff82cdd4c0 RDI: ffff888100b7bc64
> RBP: ffff888101c07000 R08: ffff888100b7bc00 R09: 00000000000000ac
> R10: 0000000000000005 R11: ffff888237d5b824 R12: 0000000000000001
> R13: ffffffff82cdd4c0 R14: ffffffff82cdd540 R15: 0000000000000000
> [<ffffffff8118dab9>] ? sched_clock_cpu+0x9/0xa0
> [<ffffffff818d26f8>] acpi_idle_enter+0x48/0xb0
> [<ffffffff81ec123c>] cpuidle_enter_state+0x7c/0x2c0
> [<ffffffff81ec14b4>] cpuidle_enter+0x24/0x40
> [<ffffffff8118e5d4>] do_idle+0x1c4/0x210
> [<ffffffff8118e79e>] cpu_startup_entry+0x1e/0x20
> [<ffffffff810a8a4a>] start_secondary+0x11a/0x120
> [<ffffffff81000103>] secondary_startup_64_no_verify+0xae/0xbb
> ---[ end trace aac4ad8b95ba31e5 ]---
>
> Michal seems to have run into the same issue[1]. He already correctly
> diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
> cfs_rq's to list on unthrottle") is causing the preconditions for the
> UAF to happen by re-adding cfs_rq's also to task groups that have no
> more running tasks, i.e. also to dead ones. His analysis, however,
> misses the real root cause and it cannot be seen from the crash
> backtrace only, as the real offender is tg_unthrottle_up() getting
> called via sched_cfs_period_timer() via the timer interrupt at an
> inconvenient time.
>
> When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
> task group, it doesn't protect itself from getting interrupted. If the
> timer interrupt triggers while we iterate over all CPUs or after
> unregister_fair_sched_group() has finished but prior to unlinking the
> task group, sched_cfs_period_timer() will execute and walk the list of
> task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
> dying task group. These will later -- in free_fair_sched_group() -- be
> kfree()'ed while still being linked, leading to the fireworks Kevin and
> Michal are seeing.
>
> To fix this race, ensure the dying task group gets unlinked first.
> However, simply switching the order of unregistering and unlinking the
> task group isn't sufficient, as concurrent RCU walkers might still see
> it, as can be seen below:
>
> CPU1: CPU2:
> : timer IRQ:
> : do_sched_cfs_period_timer():
> : :
> : distribute_cfs_runtime():
> : rcu_read_lock();
> : :
> : unthrottle_cfs_rq():
> sched_offline_group(): :
> : walk_tg_tree_from(…,tg_unthrottle_up,…):
> list_del_rcu(&tg->list); :
> (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
> : :
> (2) list_del_rcu(&tg->siblings); :
> : tg_unthrottle_up():
> unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> : :
> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
> : :
> : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> (3) : list_add_leaf_cfs_rq(cfs_rq);
> : :
> : :
> : :
> : :
> : :
> (4) : rcu_read_unlock();
>
> CPU 2 walks the task group list in parallel to sched_offline_group(),
> specifically, it'll read the soon to be unlinked task group entry at
> (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
> still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all
> cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group().
> Meanwhile CPU 2 will re-add some of these at (3), which is the cause of
> the UAF later on.
>
> To prevent this additional race from happening, we need to wait until
> walk_tg_tree_from() has finished traversing the task groups, i.e. after
> the RCU read critical section ends in (4). Afterwards we're safe to call
> unregister_fair_sched_group(), as each new walk won't see the dying task
> group any more.
>
> Using synchronize_rcu() might be seen as a too heavy hammer to nail this
> problem. However, the overall tear down sequence (e.g., as documented in
> css_free_rwork_fn()) already relies on quite a few assumptions regarding
> execution context and RCU grace periods from passing. Looking at the
> autogroup code, which calls sched_destroy_group() directly after
> sched_offline_group() and the apparent need to have at least one RCU
> grace period expire after unlinking the task group, prior to calling
> unregister_fair_sched_group(), there seems to be no better alternative.
> Calling unregister_fair_sched_group() via call_rcu() will only lead to
> trouble in sched_offline_group() which also relies on (yet another)
> expired RCU grace period.
>
> This patch survives Michal's reproducer[2] for 8h+ now, which used to
> trigger within minutes before.
>
> [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@xxxxxxxx/
> [2] https://lore.kernel.org/lkml/20211102160228.GA57072@xxxxxxxxxxxxxxxxx/
>
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> Cc: Odin Ugedal <odin@xxxxxxx>
> Cc: Michal Koutný <mkoutny@xxxxxxxx>
> Reported-by: Kevin Tanguy <kevin.tanguy@xxxxxxxxxxxx>
> Suggested-by: Brad Spengler <spender@xxxxxxxxxxxxxx>
> Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx>
> ---
> kernel/sched/core.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 978460f891a1..60125a6c9d1b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> + * sched_cfs_period_timer()).
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> spin_unlock_irqrestore(&task_group_lock, flags);
> +
> + /*
> + * Wait for all pending users of this task group to leave their RCU
> + * critical section to ensure no new user will see our dying task
> + * group any more. Specifically ensure that tg_unthrottle_up() won't
> + * add decayed cfs_rq's to it.
> + */
> + synchronize_rcu();
> +
> + /* End participation in shares distribution: */
> + unregister_fair_sched_group(tg);
> }
>
> static void sched_change_group(struct task_struct *tsk, int type)
> --
> 2.30.2
>