Re: [RFC PATCH 0/3] sched: delayed thread migration

From: Peter Zijlstra
Date: Wed Oct 21 2020 - 07:49:51 EST


On Wed, Oct 21, 2020 at 12:40:44PM +0200, Redha wrote:

> >> The main idea behind this patch series is to bring to light the frequency
> >> inversion problem that will become more and more prominent with new CPUs
> >> that feature per-core DVFS. The solution proposed is a first idea for
> >> solving this problem that still needs to be tested across more CPUs and
> >> with more applications.
> > Which is why schedutil (the only cpufreq gov anybody should be using) is
> > integrated with the scheduler and closes the loop and tells the CPU
> > about the expected load.
> >
> While I agree that schedutil is probably a good option, I'm not sure we
> treat exactly the same problem. schedutil aims at mapping the frequency of
> the CPU to the actual load. What I'm saying is that since it takes some
> time for the frequency to match the load, why not account for the frequency
> when making placement/migration decisions.

Because overhead, mostly :/ EAS does some of that. Mostly wakeup CPU
selection is already a bottle-neck for some applications (see the fight
over select_idle_sibling()).

Programming a timer is out of budget for high rate wakeup workloads.
Worse, you also don't prime the CPU to ramp up during the enforced
delay.

Also, two new config knobs :-(

> I know that with the frequency invariance code, capacity accounts for
> frequency, which means that thread placement decisions do account for
> frequency indirectly. However, we still have performance improvements
> with our patch for the workloads with fork/wait patterns. I really
> believe that we can still gain performance if we make decisions while
> accounting for the frequency more directly.

So I don't think that's fundamentally a DVFS problem though, just
something that's exacerbated by it. There's a long history with trying
to detect this pattern, see for example WF_SYNC and wake_affine().

(we even had some code in the early CFS days that measured overlap
between tasks, to account for the period between waking up the recipient
and blocking on the answer, but that never worked reliably either, so
that went the way of the dodo)

The classical micro-benchmark is pipe-bench, which ping-pongs a single
byte between two tasks over a pipe. If you run that on a single CPU it
is _much_ faster then when the tasks get split up. DVFS is just like
caches here, yet another reason.

schedutil does solve the problem where, when we migrate a task, the CPU
would have to individually re-learn the DVFS state. By using the
scheduler statistics we can program the DVFS state up-front, on
migration. Instead of waiting for it.

So from that PoV, schedutil fundamentally solves the individual DVFS
problem as best is possible. It closes the control loop; we no longer
have individually operating control loops that are unaware of one
another.