Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()
From: Yury Norov
Date: Tue Apr 21 2020 - 10:22:58 EST
On Tue, Apr 21, 2020 at 03:28:14PM +0200, Vincent Guittot wrote:
> On Tue, 21 Apr 2020 at 15:18, Valentin Schneider
> <valentin.schneider@xxxxxxx> wrote:
> >
> >
> > On 21/04/20 13:13, Qais Yousef wrote:
> > > On 04/14/20 19:58, Valentin Schneider wrote:
> > >>
> > >> I'm a bit wary about such blanket changes. I feel like most places impacted
> > >> by this change don't gain anything by using the random thing. In sched land
> > >> that would be:
> > >
> > > The API has always been clear that cpumask_any return a random cpu within the
> > > mask. And the fact it's a one liner with cpumask_first() directly visible,
> > > a user made the choice to stick to cpumask_any() indicates that that's what
> > > they wanted.
> > >
> > > Probably a lot of them they don't care what cpu is returned and happy with the
> > > random value. I don't see why it has to have an effect. Some could benefit,
> > > like my use case here. Or others truly don't care, then it's fine to return
> > > anything, as requested.
> > >
> >
> > Exactly, *some* (which AFAICT is a minority) might benefit. So why should
> > all the others pay the price for a functionality they do not need?
> >
> > I don't think your change would actually cause a splat somewhere; my point
> > is about changing existing behaviour without having a story for it. The
> > thing said 'pick a "random" cpu', sure, but it never did that, it always
> > picked the first.
> >
> > I've pointed out two examples that want to be cpumask_first(), and I'm
> > absolutely certain there are more than these two out there. What if folks
> > ran some performance test and were completely fine with the _first()
> > behaviour? What tells you randomness won't degrade some cases?
>
> I tend to agree that any doesn't mean random and using a random cpu
> will create strange behavior
>
> One example is the irq affinity on b.L system. Right now, the irq are
> always pinned to the same CPU (the 1st one which is most probably a
> Little) but with your change we can imagine that this will change and
> might ever change over 2 consecutives boot if for whatever reason (and
> this happen) the drivers are not probed in the same order . At the end
> you will run some tests with irq on little and other time irq on big.
> And more generally speaking and a SMP system can be impacted because
> the irq will not be pinned to the same CPU with always the same other
> irqs
>
> >
> > IMO the correct procedure is to keep everything as it is and improve the
> > specific callsites that benefit from randomness. I get your point that
>
> I agree with this point
> > using cpumask_any() should be a good enough indicator of the latter, but I
> > don't think it can realistically be followed. To give my PoV, if in the
> > past someone had used a cpumask_any() where a cpumask_first() could do, I
> > would've acked it (disclaimer: super representative population of sample
> > size = 1).
> >
> > Flipping the switch on everyone to then have a series of patches "oh this
> > one didn't need it", "this one neither", "I actually need this to be the
> > first" just feels sloppy.
> >
> > > I CCed Marc who's the maintainer of this file who can clarify better if this
> > > really breaks anything.
> > >
> > > If any interrupt expects to be affined to a specific CPU then this must be
> > > described in DT/driver. I think the GIC controller is free to distribute them
> > > to any cpu otherwise if !force. Which is usually done by irq_balancer anyway
> > > in userspace, IIUC.
> > >
> > > I don't see how cpumask_any_and() break anything here too. I actually think it
> > > improves on things by better distribute the irqs on the system by default.
> > >
> >
> > As you say, if someone wants smarter IRQ affinity they can do irq_balancer
> > and whatnot. The default kernel policy for now has been to shove everything
> > on the lowest-numbered CPU, and I see no valid reason to change that.
My 5 cents. I was also surprised when I found cpumask_any() nailed to the first
CPU. But for my use I found it beneficial.
Namely, all system IRQs and other events are targeted to the first CPU which is
considered as system maintenance unit. Other CPUs are dedicated to user-specific
payloads using task isolation. This approach improves latency a lot.
Systems that have many cores operating in idling/powersave mode probably benefit
from the state of things as well - they don't wake up sleeping cores and therefore
save power and improve IRQ turnaround.
Thanks,
Yury