Re: [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs

From: Eugene Syromiatnikov
Date: Thu Aug 19 2021 - 12:19:38 EST


On Thu, Aug 19, 2021 at 01:09:17PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 18, 2021 at 01:17:34AM +0200, Eugene Syromiatnikov wrote:
> > On Tue, Aug 17, 2021 at 05:52:43PM +0200, Peter Zijlstra wrote:
> > > Urgh... lemme guess, your HP BIOS is funny and reports more possible
> > > CPUs than you actually have resulting in cpu_possible_mask !=
> > > cpu_online_mask. Alternatively, you booted with nr_cpus= or something
> > > daft like that.
> >
> > Yep, it seems to be the case:
> >
> > # cat /sys/devices/system/cpu/possible
> > 0-7
> > # cat /sys/devices/system/cpu/online
> > 0-3
> >
>
> I think the below should work... can you please verify?

Yes, it no longer oops'es now, thank you!

# cat /sys/devices/system/cpu/possible
0-7
# cat /sys/devices/system/cpu/online
0-3
# ./prctl-sched-core-oops-repro
Iteration 0 status: 0
Iteration 1 status: 0
# ../src/strace -fvq -eprctl,clone,setsid -esignal=none ./prctl-sched-core-oops-repro
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f510b1c7890) = 108328
[pid 108328] setsid() = 108328
[pid 108328] +++ exited with 0 +++
Iteration 0 status: 0
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 108324, 0x2 /* PIDTYPE_PGID */, NULL) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f510b1c7890) = 108329
[pid 108329] setsid() = 108329
[pid 108329] +++ exited with 0 +++
Iteration 1 status: 0
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 108324, 0x2 /* PIDTYPE_PGID */, NULL) = 0
+++ exited with 0 +++

> ---
> Subject: sched: Fix Core-wide rq->lock for uninitialized CPUs
>
> Eugene tripped over the case where rq_lock(), as called in a
> for_each_possible_cpu() loop came apart because rq->core hadn't been
> setup yet.
>
> This is a somewhat unusual, but valid case.
>
> Rework things such that rq->core is initialized to point at itself. IOW
> initialize each CPU as a single threaded Core. CPU online will then join
> the new CPU (thread) to an existing Core where needed.
>
> For completeness sake, have CPU offline fully undo the state so as to
> not presume the topology will match the next time it comes online.
>
> Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock")
> Reported-by: Eugene Syromiatnikov <esyr@xxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>

Tested-by: Eugene Syromiatnikov <esyr@xxxxxxxxxx>