Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

From: Valentin Schneider
Date: Tue Apr 14 2020 - 14:59:31 EST


Hi,

On 14/04/20 16:05, Qais Yousef wrote:
> Now that we have a proper function that returns a 'random' CPU in a mask [1]
> utilize that in find_lowest_rq() to solve the thundering herd issue described
> in this thread
>
> https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/
>
> But as a pre-amble, I noticed that the new cpumask_any_and_distribute() is
> actually an alias for cpumask_any_and() which is documented as returning
> a 'random' cpu but actually just does cpumask_first_and().
>
> The first 3 patches cleanup the API so that the whole family of
> cpumask_any*() take advantage of the new 'random' behavior

I'm a bit wary about such blanket changes. I feel like most places impacted
by this change don't gain anything by using the random thing. In sched land
that would be:

- The single cpumask_any() in core.c::select_task_rq()
- Pretty much any function that wants a CPU id to dereference a
root_domain; there's some of them in deadline.c, topology.c

Looking some more into it, there's shadier things:

- cpufreq_offline() uses cpumask_any() to figure out the new policy
leader... That one should be cpumask_first()
- gic_set_affinity() uses cpumask_any_and() (in the common case). If this
starts using randomness, you will stop affining e.g. all SPIs to CPU0
by default (!!!)
- ... and there might be more

I think people went with cpumask_any_* mostly because there is just
cpumask_first() while there are more cpumask_any_* variants, and since
those have been returning the first set CPU for over a decade people just
went with it.

To move this forward, I would suggest renaming the current cpumask_any_*()
into cpumask_first_*(), and THEN introduce the new pseudo-random
ones. People are then free to hand-fix specific locations if it makes sense
there, like you're doing for RT.

I think it's safe to say the vast majority of the current callers do not
require randomness - the exceptions should mainly be scheduler / workqueues
and the like.

> and in patch
> 4 I convert the cpumask_first_and() --> cpumask_any_and() in find_lowest_rq()
> to allow to better distribute the RT tasks that wake up simultaneously.
>
> [1] https://lore.kernel.org/lkml/20200311010113.136465-1-joshdon@xxxxxxxxxx/
>
> CC: Juri Lelli <juri.lelli@xxxxxxxxxx>
> CC: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> CC: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
> CC: Ben Segall <bsegall@xxxxxxxxxx>
> CC: Mel Gorman <mgorman@xxxxxxx>
> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> CC: Yury Norov <yury.norov@xxxxxxxxx>
> CC: Paul Turner <pjt@xxxxxxxxxx>
> CC: Alexey Dobriyan <adobriyan@xxxxxxxxx>
> CC: Josh Don <joshdon@xxxxxxxxxx>
> CC: Pavan Kondeti <pkondeti@xxxxxxxxxxxxxx>
> CC: linux-kernel@xxxxxxxxxxxxxxx
>
> Qais Yousef (4):
> cpumask: Rename cpumask_any_and_distribute
> cpumask: Make cpumask_any() truly random
> cpumask: Convert cpumask_any_but() to the new random function
> sched/rt: Better distribute tasks that wakeup simultaneously
>
> include/linux/cpumask.h | 33 ++++++-----------
> kernel/sched/core.c | 2 +-
> kernel/sched/rt.c | 4 +-
> lib/cpumask.c | 82 +++++++++++++++++++++++++++--------------
> 4 files changed, 68 insertions(+), 53 deletions(-)