Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

From: Vineeth Pillai
Date: Mon Aug 31 2020 - 09:02:07 EST




On 8/29/20 3:47 AM, peterz@xxxxxxxxxxxxx wrote:
During hotplug stress test, we have noticed that while a sibling is in
pick_next_task, another sibling can go offline or come online. What
we have observed is smt_mask get updated underneath us even if
we hold the lock. From reading the code, looks like we don't hold the
rq lock when the mask is updated. This extra logic was to take care of that.
Sure, the mask is updated async, but _where_ is the actual problem with
that?
One issue that we observed was with the inconsistent view of smt_mask
in the three loops in pick_next_task can lead to corner cases. The first loop
resets all coresched fields, second loop picks the task for each siblings and
then if the sibling came online in the third loop, we IPI it and it crashes for a
NULL core_pick. Similarly I think we might have issues if the sibling goes
offline in the last loop or sibling is offline/online in the pick loop.

It might be possible to do specific checks for core_pick in the loops to fix the
corner cases above. But I was not sure if there were more and hence took this
approach. I understand that cpumask_weight is expensive and will try to fix it
using core_pick to fix the corner cases.


On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
detail the individual hotplug changes squashed into this patch:
https://lore.kernel.org/lkml/20200815031908.1015049-9-joel@xxxxxxxxxxxxxxxxx/
https://lore.kernel.org/lkml/20200815031908.1015049-11-joel@xxxxxxxxxxxxxxxxx/
That one looks fishy, the pick is core wide, making that pick_seq per rq
just doesn't make sense.
I think there are couple of scenarios where pick_seq per sibling will be
useful. One is this hotplug case, where you need to pick only for the online
siblings and the offline siblings can come online async. Second case that
we have seen is, we don't need to mark pick for siblings when we pick a task
which is currently running. Marking the pick core wide will make the sibling
take the fast path and re-select the same task during a schedule event
due to preemption or its time slice expiry.

https://lore.kernel.org/lkml/20200815031908.1015049-12-joel@xxxxxxxxxxxxxxxxx/
This one reads like tinkering, there is no description of the actual
problem just some code that makes a symptom go away.

Sure, on hotplug the smt mask can change, but only for a CPU that isn't
actually scheduling, so who cares.

/me re-reads the hotplug code...

..ooOO is the problem that we clear the cpumasks on take_cpu_down()
instead of play_dead() ?! That should be fixable.
Yes, we get called into schedule(for going idle) for an offline cpu and it gets
confused. Also I think there could be problems while it comes online as well,
like I mentioned above. We might IPI a sibling which just came online with a
NULL core_pick. But I think we can fix it with specific checks for core_pick.

I shall look into the smt mask update side of the hotplug again and see
if corner cases could be better handled there.

Thanks,
Vineeth