Re: [RFC PATCH 06/16] sched: Add core wide task selection and scheduling.

From: Vineeth Remanan Pillai
Date: Fri Jul 03 2020 - 16:22:02 EST


On Wed, Jul 1, 2020 at 7:28 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
> Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
>
> The selection logic does not run correctly if the current CPU is not in the
> cpu_smt_mask (which it is not because the CPU is offlined when the stopper
> finishes running and needs to switch to idle). There are also other issues
> fixed by the patch I think such as: if some other sibling set core_pick to
> something, however the selection logic on current cpu resets it before
> selecting. In this case, we need to run the task selection logic again to
> make sure it picks something if there is something to run. It might end up
> picking the wrong task.
>
I am not sure if this can happen. If the other sibling sets core_pick, it
will be under the core wide lock and it should set the core_sched_seq also
before releasing the lock. So when this cpu tries, it would see the core_pick
before resetting it. Is this the same case you were mentioning? Sorry if I
misunderstood the case you mentioned..

> Yet another issue was, if the stopper thread is an
> unconstrained pick, then rq->core_pick is set. The next time task selection
> logic runs when stopper needs to switch to idle, the current CPU is not in
> the smt_mask. This causes the previous ->core_pick to be picked again which
> happens to be the unconstrained task! so the stopper keeps getting selected
> forever.
>
I did not clearly understand this. During an unconstrained pick, current
cpu's core_pick is not set and tasks are not picked for siblings as well.
If it is observed being set in the v6 code, I think it should be a bug.

> That and there are a few more safe guards and checks around checking/setting
> rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> issue. Now it runs for an hour or so without issue. (Torture testing debug
> changes: https://bit.ly/38htfqK ).
>
> Various fixes were tried causing varying degrees of crashes. Finally I found
> that it is easiest to just add current CPU to the smt_mask's copy always.
> This is so that task selection logic always runs on the current CPU which
> called schedule().
>
> [...]
> cpu = cpu_of(rq);
> - smt_mask = cpu_smt_mask(cpu);
> + /* Make a copy of cpu_smt_mask as we should not set that. */
> + cpumask_copy(&select_mask, cpu_smt_mask(cpu));
> +
> + /*
> + * Always make sure current CPU is added to smt_mask so that below
> + * selection logic runs on it.
> + */
> + cpumask_set_cpu(cpu, &select_mask);
>
I like this idea. Probably we can optimize it a bit. We get here with cpu
not in smt_mask only during an offline and online(including the boot time
online) phase. So we could probably wrap it in an "if (unlikely())". Also,
during this time, it would be idle thread or some hotplug online thread that
would be runnable and no other tasks should be runnable on this cpu. So, I
think it makes sense to do an unconstrained pick rather than a costly sync
of all siblings. Probably something like:

cpumask_copy(&select_mask, cpu_smt_mask(cpu));
if (unlikely(cpumask_test_cpu(cpu, &select_mask))) {
cpumask_set_cpu(cpu, &select_mask);
need_sync = false;
}

By setting need_sync to false, we will do an unconstrained pick and will
not sync with other siblings. I guess we need to reset need_sync after
or in the following for_each_cpu loop, because the loop may set it.

> /*
> * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
> @@ -4351,7 +4358,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

> if (i == cpu && !need_sync && !p->core_cookie) {
> next = p;
> + rq_i->core_pick = next;
> + rq_i->core_sched_seq = rq_i->core->core_pick_seq;
>
I think we would not need these here. core_pick needs to be set only
for siblings if we are picking a task for them. For unconstrained pick,
we pick only for ourselves. Also, core_sched_seq need not be synced here.
We might already be synced with the existing core->core_pick_seq. Even
if it is not synced, I don't think it will cause an issue in subsequent
schedule events.


> done:
> + /*
> + * If we reset a sibling's core_pick, make sure that we picked a task
> + * for it, this is because we might have reset it though it was set to
> + * something by another selector. In this case we cannot leave it as
> + * NULL and should have found something for it.
> + */
> + for_each_cpu(i, &select_mask) {
> + WARN_ON_ONCE(!cpu_rq(i)->core_pick);
> + }
> +
I think this check will not be true always. For unconstrained pick, we
do not pick tasks for siblings and hence do not set core_pick for them.
So this WARN_ON will fire for unconstrained pick. Easily reproducible
by creating an empty cgroup and tagging it. Then only unconstrained
picks will happen and this WARN_ON fires. I guess this check after the
done label does not hold and could be removed.

Thanks,
Vineeth