Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Peter Zijlstra
Date: Tue Oct 05 2021 - 06:24:06 EST


On Wed, Sep 22, 2021 at 06:38:53PM +0100, Mel Gorman wrote:

>
> I'm not seeing an alternative suggestion that could be turned into
> an implementation. The current value for sched_wakeup_granularity
> was set 12 years ago was exposed for tuning which is no longer

(it has always been SCHED_DEBUG)

> the case. The intent was to allow some dynamic adjustment between
> sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> over-scheduling in the worst case without disabling preemption entirely
> (which the first version did).
>
> Should we just ignore this problem and hope it goes away or just let
> people keep poking silly values into debugfs via tuned?

People are going to do stupid no matter what (and tuned is very much
included there -- poking random numbers in debug interfaces without
doing analysis on what the workload does and needs it just wasting
everybodies time. RHT really should know better than that).

I'm also thinking that adding more heuristics isn't going to improve the
situation lots.

For the past # of years people have been talking about extendng the task
model for SCHED_NORMAL, latency_nice was one such proposal.

If we really want to fix this proper, and not make a bigger mess of
things, we should look at all these various workloads and identify
*what* specifically they want and *why*.

Once we have this enumerated, we can look at what exactly we can provide
and how to structure the interface.

The extention must be hint only, we should be free to completely ignore
it.

The things I can think of off the top of my head are:

- tail latency; prepared to waste time to increase the odds of running
sooner. Possible effect: have this task always do a full
select_idle_sibling() scan.

(there's also the anti case, which I'm not sure how to enumerate,
basically they don't want select_idle_sibling(), just place the task
wherever)

- non-interactive; doesn't much care about wakeup latency; can suffer
packing?

- background; (implies non-interactive?) doesn't much care about
completion time either, just cares about efficiency

- interactive; cares much about wakeup-latency; cares less about
throughput.

- (energy) efficient; cares more about energy usage than performance


Now, we already have SCHED_BATCH that completely kills wakeup preemption
and SCHED_IDLE lets everything preempt. I know SCHED_IDLE is in active
use (because we see patches for that), I don't think people use
SCHED_BATCH much, should they?


Now, I think a number of contraints are going to be mutually exclusive,
and if we get such tasks combined there's just nothing much we can do
about it (and since it's all hints, we good).

We should also not make the load-balance condition too complicated, we
don't want it to try and (badly) solve a large set of constraints.
Again, if possible respect the hints, otherwise tough luck.