Re: [SchedulerWakeupLatency] Skipping Idle Cores and CPU Search

From: chris hyser
Date: Wed Jul 22 2020 - 14:58:50 EST


On 7/20/20 4:47 AM, Dietmar Eggemann wrote:
On 10/07/2020 01:08, chris hyser wrote:

[...]

D) Desired behavior:

Reduce the maximum wake-up latency of designated CFS tasks by skipping
some or all of the idle CPU and core searches by setting a maximum idle
CPU search value (maximum loop iterations).

Searching 'ALL' as the maximum would be the default and implies the
current code path which may or may not search up to ALL. Searching 0
would result in the least latency (shown with experimental results to be
included if/when patchset goes up). One of the considerations is that
the maximum length of the search is a function of the size of the LLC
scheduling domain and this is platform dependent. Whether 'some', i.e. a
numerical value limiting the search can be used to "normalize" this
latency across differing scheduling domain sizes is under investigation.
Clearly differing hardware will have many other significant differences,
but in different sized and dynamically sized VMs running on fleets of
common HW this may be interesting.

I assume that this task-specific feature could coexists in
select_idle_core() and select_idle_cpu() with the already existing
runtime heuristics (test_idle_cores() and the two sched features
mentioned under E/F) to reduce the idle CPU search space on a busy system.

Yes, so perhaps a more generalized summary of the feature is that is simply places a per-task maximum number of iterations on the various 'for_each_cpu' loops (whose max is platform dependent) in this path. Any other technique to short circuit the loop below this max would be fine including the fact that the very first 'idle' check in a loop may succeed and that is perfectly ok in terms of minimizing the search latency. This really only kicks in on busy systems and while system or scheduling domain wide heuristics can reduce the cost to tasks for not doing something per-task like this, they can't drive the loop iteration search to 0 because that is BAD policy when applied to the wrong tasks or too many tasks.



E/F) Existing knobs (and limitations):

There are existing sched_feat: SIS_AVG_CPU, SIS_PROP that attempt to
short circuit the idle cpu search path in select_idle_cpu() based on
estimations of the current costs of searching. Neither provides a means

[...]

H) Range Analysis:

The knob is a positive integer representing "max number of CPUs to
search". The default would be 'ALL' which could be translated as
INT_MAX. '0 searches' translates to 0. Other values represent a max
limit on the search, in this case iterations of a for loop.

IMHO the opposite use case for this feature (favour high throughput over
short wakeup latency (Facebook) is already cured by the changes
introduced by commit 10e2f1acd010 ("sched/core: Rewrite and improve
select_idle_siblings()"), i.e. with the current implementation of sis().

It seems that they don't need an additional per-task feature on top of
the default system-wide runtime heuristics.

Agreed and I hope I've clarified how the attribute in question should not affect that as the default for the attribute is basically "no short cut because of this", other heuristics may apply.

-chrish