Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

From: Qais Yousef
Date: Tue Apr 21 2020 - 08:13:14 EST


+Rafael +Marc

(There's a question about a function that you maintain below, the context is
that cpumask_any() now truly returns a random cpu in the mask and there were
concern it might break something you maintain)

On 04/14/20 19:58, Valentin Schneider wrote:
> Hi,
>
> On 14/04/20 16:05, Qais Yousef wrote:
> > Now that we have a proper function that returns a 'random' CPU in a mask [1]
> > utilize that in find_lowest_rq() to solve the thundering herd issue described
> > in this thread
> >
> > https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/
> >
> > But as a pre-amble, I noticed that the new cpumask_any_and_distribute() is
> > actually an alias for cpumask_any_and() which is documented as returning
> > a 'random' cpu but actually just does cpumask_first_and().
> >
> > The first 3 patches cleanup the API so that the whole family of
> > cpumask_any*() take advantage of the new 'random' behavior
>
> I'm a bit wary about such blanket changes. I feel like most places impacted
> by this change don't gain anything by using the random thing. In sched land
> that would be:

The API has always been clear that cpumask_any return a random cpu within the
mask. And the fact it's a one liner with cpumask_first() directly visible,
a user made the choice to stick to cpumask_any() indicates that that's what
they wanted.

Probably a lot of them they don't care what cpu is returned and happy with the
random value. I don't see why it has to have an effect. Some could benefit,
like my use case here. Or others truly don't care, then it's fine to return
anything, as requested.

>
> - The single cpumask_any() in core.c::select_task_rq()
> - Pretty much any function that wants a CPU id to dereference a
> root_domain; there's some of them in deadline.c, topology.c
>
> Looking some more into it, there's shadier things:
>
> - cpufreq_offline() uses cpumask_any() to figure out the new policy
> leader... That one should be cpumask_first()

I CCed Rafael who's the maintainer of this file who can speak better of what
should be expected.

I don't see why it _should_ be cpumask_first(). AFAICT any cpu can be the new
leader. So nothing breaks here.

> - gic_set_affinity() uses cpumask_any_and() (in the common case). If this
> starts using randomness, you will stop affining e.g. all SPIs to CPU0
> by default (!!!)

I CCed Marc who's the maintainer of this file who can clarify better if this
really breaks anything.

If any interrupt expects to be affined to a specific CPU then this must be
described in DT/driver. I think the GIC controller is free to distribute them
to any cpu otherwise if !force. Which is usually done by irq_balancer anyway
in userspace, IIUC.

I don't see how cpumask_any_and() break anything here too. I actually think it
improves on things by better distribute the irqs on the system by default.

Marc will know better for sure. It's been a long time I looked at irqchip code.

> - ... and there might be more

I can appreciate you have some concerns this might break something. But IMO
this will be isolated cases of bad copy/paste. I think the blanket conversion
of cpumask_any() to cpumask_first() is the bad one because it dishonours the
choice developers made to pick one over the other. As I mention above the doc
was clear it will return a random value and the oneliner gave a clear
alternative to the user if they missed the cpumask_first() API.

The name of the API is so descriptive anyway to think that most users got it
wrong. And I think it acts as a good natural documentation of the code.

>
> I think people went with cpumask_any_* mostly because there is just
> cpumask_first() while there are more cpumask_any_* variants, and since
> those have been returning the first set CPU for over a decade people just
> went with it.

This is speculation. Unless we ask every developer we can't tell. Given the
clear documentation of cpumask_any() returns a random value, the natural
conclusion is that people want that or just truly don't care.

>
> To move this forward, I would suggest renaming the current cpumask_any_*()
> into cpumask_first_*(), and THEN introduce the new pseudo-random
> ones. People are then free to hand-fix specific locations if it makes sense
> there, like you're doing for RT.

I strongly disagree with this. I'd rather keep the status-quo instead and just
use the new function in rt.c as it's currently named.

>
> I think it's safe to say the vast majority of the current callers do not
> require randomness - the exceptions should mainly be scheduler / workqueues
> and the like.

It's hard to judge. As the API clearly says this returns a random value,
I think assuming that most users got it wrong is the unnatural way of
thinking. If there happened to be bad users, that's an isolated cases and bugs
that we can certainly be easily fixed.

Thanks

--
Qais Yousef

>
> > and in patch
> > 4 I convert the cpumask_first_and() --> cpumask_any_and() in find_lowest_rq()
> > to allow to better distribute the RT tasks that wake up simultaneously.
> >
> > [1] https://lore.kernel.org/lkml/20200311010113.136465-1-joshdon@xxxxxxxxxx/
> >
> > CC: Juri Lelli <juri.lelli@xxxxxxxxxx>
> > CC: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > CC: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> > CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
> > CC: Ben Segall <bsegall@xxxxxxxxxx>
> > CC: Mel Gorman <mgorman@xxxxxxx>
> > CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > CC: Yury Norov <yury.norov@xxxxxxxxx>
> > CC: Paul Turner <pjt@xxxxxxxxxx>
> > CC: Alexey Dobriyan <adobriyan@xxxxxxxxx>
> > CC: Josh Don <joshdon@xxxxxxxxxx>
> > CC: Pavan Kondeti <pkondeti@xxxxxxxxxxxxxx>
> > CC: linux-kernel@xxxxxxxxxxxxxxx
> >
> > Qais Yousef (4):
> > cpumask: Rename cpumask_any_and_distribute
> > cpumask: Make cpumask_any() truly random
> > cpumask: Convert cpumask_any_but() to the new random function
> > sched/rt: Better distribute tasks that wakeup simultaneously
> >
> > include/linux/cpumask.h | 33 ++++++-----------
> > kernel/sched/core.c | 2 +-
> > kernel/sched/rt.c | 4 +-
> > lib/cpumask.c | 82 +++++++++++++++++++++++++++--------------
> > 4 files changed, 68 insertions(+), 53 deletions(-)