[RFC PATCH 0/5] sched: Userspace Hinting for Task Placement

From: K Prateek Nayak
Date: Sat Sep 10 2022 - 06:54:28 EST


This series implements one possible mechanism of Userspace Hinting
for scheduler aimed at influencing task placement. This work will be
discussed as a part of the talk "Linux Kernel Scheduling and split-LLC
architectures: Overview, Challenges and Opportunities"
(https://lpc.events/event/16/contributions/1274/)
in the "Real-time and Scheduling MC" at Linux Plumbers Conference 2022.

Note: This is an experimental patchset that exposes some low-level knobs
to influence task placement decision at various decision points in
the scheduler. The current API design is experimental and is only
capable of setting low-level hints. This API is not meant for public
consumption and only serves as a means to test and demonstrate the
efficacy of hints in helping the scheduler make optimal placement
decisions based on the requirements provided by the applications.
Scheduler is free to ignore the hints set by the user if it believes
that following the hints will put the system in a suboptimal state.

- Motivation

The heuristics used by the scheduler today, such as the WF_SYNC flag,
wake_wide() logic, etc., fall short at accurately inferring the nature
of the workload in terms of whether it is preferable to consolidate a
group of threads close together or if they should be spread apart. The
inability to infer the nature of the workload can lead to a series of
incorrect placement decisions that can be detrimental to the workload
performance. The penalty seems to be severe on systems with split-LLC
such as AMD EPYC.

Consider a workload like schbench. It is observed that the 99th
percentile latency reported by schebench improves drastically if
the messenger and worker are on the same LLC. However, the signaling
between the messenger and the workers happens via a futex which does
not set the WF_SYNC flag when waking up the waiting tasks. In such a
scenario, scheduler will not pull the worker thread towards the
messenger, leaving it in a suboptimal state, on an external LLC even
though there is room on the messenger's LLC to accommodate the worker.

Another example is that of tbench where tasks sleep soon after the
initial wakeup. In such a scenario, with only few running tasks in the
system, the NUMA imbalance threshold is never crossed and all the
following placements of new tasks exhibiting similar behavior will
happen on the same NUMA node and more often on a select few LLCs
within it. When all the tbench threads are forked, a wakeup signal will
lead to all the waiting tasks waking up at once, storming the LLCs,
overloading them. The imbalance at subsequent wakeup is due to the
non-uniform distribution of task across the system during the initial
placement. tbench could have benefited from spreading the tasks early
on, reducing the many migrations it would require later to reach an
optimal state however the current heuristics do not allow for this to
happen.

Peter Zijlstra, almost a year ago, had indicated that a high-level
hinting framework is needed to prevent addition of more low-level
heuristics, complicating the scheduler logic.
(https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/)
Peter suggested that the high-level hints can describe the type of
workload the user is running and internally, the scheduler can take
care of modifying few parameters and heuristics to make the workload
run optimally on a best effort basis.

Following are, and I quote, the hints described by Peter in his
response:

```
- tail latency; prepared to waste time to increase the odds of running
sooner. Possible effect: have this task always do a full
select_idle_sibling() scan.

(there's also the anti case, which I'm not sure how to enumerate,
basically they don't want select_idle_sibling(), just place the task
wherever)

- non-interactive; doesn't much care about wakeup latency; can suffer
packing?

- background; (implies non-interactive?) doesn't much care about
completion time either, just cares about efficiency

- interactive; cares much about wakeup-latency; cares less about
throughput.

- (energy) efficient; cares more about energy usage than performance
```

Although the focus is on how to bias CPU time based on the workload
characteristics and modifying the idle CPU search space for the same,
the choice of an LLC to place the task on is equally important,
especially in split-LLC architectures.

In this experiment, we show the importance of task placement, its
effect on workload performance, and how influencing the task placement
in the right direction using the low-level knobs can have significant
performance impact on a split-LLC systems such as the AMD EPYC, where
the performance penalty of getting a task placement decision wrong can
be very high.

- Design

The patch series exposes low-level hints, that instructs the scheduler
of the requested behavior at each decision points in the scheduler.
These hints can be categorized into 3 types:

- Hints for initial wakeup

These hints include:
o PR_SCHED_HINT_FORK_AFFINE: Place the task close to the parent as
long as there is an equivalent of an idle core in the local group.
o PR_SCHED_HINT_FORK_SPREAD: When there is a tie in number of idle
CPUs between the local group and the idlest group during initial
task placement, go with the group with least utilization.

The implementation currently overlooks any NUMA bias at the time of
initial placement.

- Hints for subsequent wakeup

These hints include:
o PR_SCHED_HINT_WAKE_AFFINE: Bias task placement towards the
MC Domain of the waker task.
o PR_SCHED_HINT_WAKE_HOLD: Bias task placement towards the
MC Domain where the task previously ran.

These hints are followed on a best effort basis. If the targeted MC
Domain advertises it has an idle core through the sched_domain_shared
object, the wakeup hints are followed.

Only one hint from each category can be set at once to ensure a defined
behavior. Hits from different category can be combined at once as they
are independent of each other.

Load balancer is aware of the hints and will try to not migrate a task as
long as the LLC is not overloaded.

Children will inherit hints from parent. This was done so that hints can be
set from a wrapper script and did not need any modification of the
workload.

- Potential Hints for the Future

o PR_SCHED_HINT_WAKE_WIDE

Setting PR_SCHED_HINT_WAKE_WIDE will lead to the scheduler searching
the CPUs outside the targeted MC Domain (and within the parent
domain) if the MC Domain is fully loaded and no idle CPU can be found.

This can be realised as a generic extension to Yicong Yang's efforts to
optimize task placement in a clustered system where the scheduler
searches the CPU cluster for an idle CPU before searching the rest of the
CPUs in the MC domain.
(https://lore.kernel.org/lkml/20220822073610.27205-1-yangyicong@xxxxxxxxxx/)
For non-clustered system, this can be extended to first search the
CPUs of the MC domain and then look for a CPU in the higher domain,
if necessary, with the feature being gated behind a hint.

Although unimplemented, this hint has been mentioned here to promote
discussion on the concept and possible design.

- Workflow Illustration

o For Initial Task Placement:
Yes
Parent Task ----------> Group has an equivalent of idle core? ------------> Local Group
| ^ |
| fork() | | No
v | |---------> Default Behavior
New Task --------------> PR_SCHED_HINT_FORK_AFFINE
(Inherits |
Hints) |------> tie in number of idle CPUs Yes
between local and idlest group? --> PR_SCHED_HINT_FORK_SPREAD set? ------> Group with lowest
| | utilization
No | | No
v |
Default Behavior <-----------------------

o For Subsequent Wakeup:

Yes
Waker Task ------------------------> MC has idle core? ------------> Wake up on waker's MC Domain
| ^ |
| ttwu | | No
v | |---------> Default Behavior
Wakee Task -- prev_cpu and --------> PR_SCHED_HINT_WAKE_AFFINE ^
current CPU | |
do not share | |
L3 cache | |
and wakeup hints | |
are set. |--------> PR_SCHED_HINT_WAKE_HOLD |
| |
| | No
v |
Previous MC has idle core? -------------
|
v
Wake on MC Domain where
task previously ran

- API

Note: This is a highly experimental and barebone API designed to
prototype hinting. This is not intended for public consumption.

The hint can be set from the prctl() interface as follows:

#include <sys/prctl.h>

prctl(PR_SCHED_HINT /* prctl cmd */,
PR_SCHED_HINT_FORK_AFFINE | PR_SCHED_HINT_WAKE_AFFINE /* Combination of hints */,
0 /* pid of task to set hint for. 0 is current task */,
0, 0);

Once the hint is set, the scheduler tries its best to follow the hints
when it hits a decision point.

- Results and Explanation on AMD EPYC

Changes are based on tip:sched/core at
commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"

For ease of the readers, following short forms are used to refer to the
hints:

fork_affine: PR_SCHED_HINT_FORK_AFFINE
fork_spread: PR_SCHED_HINT_FORK_SPREAD
wake_affine: PR_SCHED_HINT_WAKE_AFFINE
wake_hold: PR_SCHED_HINT_WAKE_HOLD

Following are results from tests carried out on a dual socket Zen3
system (2 x 64C/128T) in NPS1 mode:

- Hackbench

o Fork Time Hint

Test: tip no-hint fork_affine fork_spread
1-groups: 4.31 (0.00 pct) 4.46 (-3.48 pct) 4.27 (0.92 pct) 4.28 (0.69 pct)
2-groups: 4.93 (0.00 pct) 4.85 (1.62 pct) 4.91 (0.40 pct) 5.15 (-4.46 pct)
4-groups: 5.38 (0.00 pct) 5.35 (0.55 pct) 5.36 (0.37 pct) 5.31 (1.30 pct)
8-groups: 5.59 (0.00 pct) 5.49 (1.78 pct) 5.51 (1.43 pct) 5.51 (1.43 pct)
16-groups: 7.18 (0.00 pct) 7.38 (-2.78 pct) 7.31 (-1.81 pct) 7.25 (-0.97 pct)

Due to the communicating nature of the hackbench threads, they
prefer to placed on the same LLC to maximize cache efficiency.
It does not prefer spreading at fork time.

o Wakeup Hint

Test: tip no-hint wake_affine wake_hold
1-groups: 4.31 (0.00 pct) 4.46 (-3.48 pct) 4.20 (2.55 pct) 4.11 (4.64 pct)
2-groups: 4.93 (0.00 pct) 4.85 (1.62 pct) 4.74 (3.85 pct) 5.15 (-4.46 pct)
4-groups: 5.38 (0.00 pct) 5.35 (0.55 pct) 5.04 (6.31 pct) 4.54 (15.61 pct)
8-groups: 5.59 (0.00 pct) 5.49 (1.78 pct) 5.39 (3.57 pct) 5.71 (-2.14 pct)
16-groups: 7.18 (0.00 pct) 7.38 (-2.78 pct) 7.24 (-0.83 pct) 7.76 (-8.07 pct)

As the tasks within a groups have a producer consumer behavior,
hackbench threads benefit from following the waker task. Hence an
affine wakeup leads to better performance.

o Combination of Hints

Test: tip fork_affine + wake_affine fork_spread + wake_hold
1-groups: 4.31 (0.00 pct) 4.20 (2.55 pct) 4.81 (-11.60 pct)
2-groups: 4.93 (0.00 pct) 4.74 (3.85 pct) 5.09 (-3.24 pct)
4-groups: 5.38 (0.00 pct) 5.01 (6.87 pct) 5.62 (-4.46 pct)
8-groups: 5.59 (0.00 pct) 5.38 (3.75 pct) 5.69 (-1.78 pct)
16-groups: 7.18 (0.00 pct) 7.25 (-0.97 pct) 7.97 (-11.00 pct)

A combination of the hints showing good result from above continue
to benefit the hackbench threads.

*Hackbench works best with the combination of relevant hints*

- schbench

o Fork Time Hint

#workers: tip no-hint fork_affine fork_spread
1: 37.00 (0.00 pct) 38.00 (-2.70 pct) 17.00 (54.05 pct) 34.00 (8.10 pct)
2: 39.00 (0.00 pct) 36.00 (7.69 pct) 21.00 (46.15 pct) 39.00 (0.00 pct)
4: 41.00 (0.00 pct) 41.00 (0.00 pct) 28.00 (31.70 pct) 40.00 (2.43 pct)
8: 53.00 (0.00 pct) 54.00 (-1.88 pct) 39.00 (26.41 pct) 53.00 (0.00 pct)
16: 73.00 (0.00 pct) 74.00 (-1.36 pct) 68.00 (6.84 pct) 73.00 (0.00 pct)
32: 116.00 (0.00 pct) 124.00 (-6.89 pct) 113.00 (2.58 pct) 119.00 (-2.58 pct)
64: 217.00 (0.00 pct) 215.00 (0.92 pct) 205.00 (5.52 pct) 222.00 (-2.30 pct)
128: 477.00 (0.00 pct) 440.00 (7.75 pct) 445.00 (6.70 pct) 453.00 (5.03 pct)
256: 1062.00 (0.00 pct) 1026.00 (3.38 pct) 1007.00 (5.17 pct) 1021.00 (3.86 pct)
512: 47552.00 (0.00 pct) 47168.00 (0.80 pct) 47296.00 (0.53 pct) 47552.00 (0.00 pct)

schbench tasks prefer an affine fork keeping most worker task on the
same LLC and only migrating necessary tasks out later. This reduces
the p99 latency significantly.

o Wakeup Hint

#workers: tip no-hint wake_affine wake_hold
1: 37.00 (0.00 pct) 38.00 (-2.70 pct) 18.00 (51.35 pct) 32.00 (13.51 pct)
2: 39.00 (0.00 pct) 36.00 (7.69 pct) 18.00 (53.84 pct) 36.00 (7.69 pct)
4: 41.00 (0.00 pct) 41.00 (0.00 pct) 21.00 (48.78 pct) 33.00 (19.51 pct)
8: 53.00 (0.00 pct) 54.00 (-1.88 pct) 31.00 (41.50 pct) 51.00 (3.77 pct)
16: 73.00 (0.00 pct) 74.00 (-1.36 pct) 2636.00 (-3510.95 pct) 75.00 (-2.73 pct)
32: 116.00 (0.00 pct) 124.00 (-6.89 pct) 15696.00 (-13431.03 pct) 124.00 (-6.89 pct)
64: 217.00 (0.00 pct) 215.00 (0.92 pct) 15280.00 (-6941.47 pct) 224.00 (-3.22 pct)
128: 477.00 (0.00 pct) 440.00 (7.75 pct) 14800.00 (-3002.72 pct) 493.00 (-3.35 pct)
256: 1062.00 (0.00 pct) 1026.00 (3.38 pct) 15696.00 (-1377.96 pct) 1026.00 (3.38 pct)
512: 47552.00 (0.00 pct) 47168.00 (0.80 pct) 60736.00 (-27.72 pct) 49856.00 (-4.84 pct)

"wake_affine" hint is beneficial as long as the number of messengers
and the workers is fewer than the number of CPUs in the LLC, beyond
that point it causes overloading and thus causing the workers to wait
behind one another which makes the tail-latency worse. Wake hold
shows some improvements and minimal regression by avoiding cross LLC
migration as a result of the hint.

o Combination of Hints

#workers: tip no-hint fork_affine + wake_affine fork_spread + wake_hold
1: 37.00 (0.00 pct) 38.00 (-2.70 pct) 17.00 (54.05 pct) 40.00 (-8.10 pct)
2: 39.00 (0.00 pct) 36.00 (7.69 pct) 18.00 (53.84 pct) 40.00 (-2.56 pct)
4: 41.00 (0.00 pct) 41.00 (0.00 pct) 20.00 (51.21 pct) 40.00 (2.43 pct)
8: 53.00 (0.00 pct) 54.00 (-1.88 pct) 31.00 (41.50 pct) 54.00 (-1.88 pct)
16: 73.00 (0.00 pct) 74.00 (-1.36 pct) 2884.00 (-3850.68 pct) 80.00 (-9.58 pct)
32: 116.00 (0.00 pct) 124.00 (-6.89 pct) 15408.00 (-13182.75 pct) 123.00 (-6.03 pct)
64: 217.00 (0.00 pct) 215.00 (0.92 pct) 15344.00 (-6970.96 pct) 224.00 (-3.22 pct)
128: 477.00 (0.00 pct) 440.00 (7.75 pct) 14896.00 (-3022.85 pct) 450.00 (5.66 pct)
256: 1062.00 (0.00 pct) 1026.00 (3.38 pct) 15664.00 (-1374.95 pct) 997.00 (6.12 pct)
512: 47552.00 (0.00 pct) 47168.00 (0.80 pct) 60992.00 (-28.26 pct) 55232.00 (-16.15 pct)

Combination of hints bring out worst of the scenarios.

*schbench works best with only fork hint set*

- tbench

o Fork Time Hints

Clients: tip no-hint fork_affine fork_spread
1 573.26 (0.00 pct) 572.29 (-0.16 pct) 572.70 (-0.09 pct) 569.64 (-0.63 pct)
2 1131.19 (0.00 pct) 1119.57 (-1.02 pct) 1131.97 (0.06 pct) 1101.03 (-2.66 pct)
4 2100.07 (0.00 pct) 2070.66 (-1.40 pct) 2094.80 (-0.25 pct) 2011.64 (-4.21 pct)
8 3809.88 (0.00 pct) 3784.16 (-0.67 pct) 3458.94 (-9.21 pct) 3867.70 (1.51 pct)
16 6560.72 (0.00 pct) 6449.64 (-1.69 pct) 6342.78 (-3.32 pct) 6700.50 (2.13 pct)
32 12203.23 (0.00 pct) 12180.02 (-0.19 pct) 10411.44 (-14.68 pct) 13104.29 (7.38 pct)
64 22389.81 (0.00 pct) 23084.51 (3.10 pct) 16614.14 (-25.79 pct) 24353.76 (8.77 pct)
128 32449.37 (0.00 pct) 33561.28 (3.42 pct) 19971.67 (-38.45 pct) 36201.16 (11.56 pct)
256 58962.40 (0.00 pct) 59118.43 (0.26 pct) 26836.13 (-54.48 pct) 61721.06 (4.67 pct)
512 59608.71 (0.00 pct) 60246.78 (1.07 pct) 36889.55 (-38.11 pct) 59696.57 (0.14 pct)
1024 58037.02 (0.00 pct) 58532.41 (0.85 pct) 39936.06 (-31.18 pct) 57445.62 (-1.01 pct)

tbench showcases the thundering herd problem where the task after
initial wakeup quickly go to sleep. For such a case, finding an
idle group based on utilization using the fork spread allows for a
better distribution of tasks early on and lesser migrations later to
reach a stable state.

*tbench can gain some more performance with fork spread hint on a split-LLC system such as AMD EPYC*

- Results from Unified LLC

Changes are based on tip:sched/core at
commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"

For ease of the readers, same short forms are followed.

Following are results from tests carried out on a dual socket
system based on 3rd Generation Intel Xeon Scalable Processors
(2 x 32C/64T):

- Hackbench

0 Fork time hint

Test: tip no-hint fork_affine fork_spread
1-groups: 2.38 (0.00 pct) 2.33 (2.10 pct) 2.25 (5.46 pct) 2.49 (-4.62 pct)
2-groups: 3.84 (0.00 pct) 3.64 (5.20 pct) 3.80 (1.04 pct) 3.77 (1.82 pct)
4-groups: 4.50 (0.00 pct) 4.39 (2.44 pct) 4.38 (2.66 pct) 4.40 (2.22 pct)
8-groups: 5.84 (0.00 pct) 5.94 (-1.71 pct) 5.96 (-2.05 pct) 5.92 (-1.36 pct)
16-groups: 11.91 (0.00 pct) 12.37 (-3.86 pct) 12.80 (-7.47 pct) 12.68 (-6.46 pct)

For lower number of groups, the hints show benefit for the test
system, however, the performance degrades for larger number of groups
for both the type of fork time hints.

o Wakeup Hint

Test: tip no-hint wake_affine wake_hold
1-groups: 2.38 (0.00 pct) 2.33 (2.10 pct) 1.98 (16.80 pct) 2.01 (15.54 pct)
2-groups: 3.84 (0.00 pct) 3.64 (5.20 pct) 3.39 (11.71 pct) 3.96 (-3.12 pct)
4-groups: 4.50 (0.00 pct) 4.39 (2.44 pct) 4.13 (8.22 pct) 4.83 (-7.33 pct)
8-groups: 5.84 (0.00 pct) 5.94 (-1.71 pct) 6.03 (-3.25 pct) 6.11 (-4.62 pct)
16-groups: 11.91 (0.00 pct) 12.37 (-3.86 pct) 9.47 (20.48 pct) 10.78 (9.48 pct)

Affine wakeup seems to benefit a unified LLC system too. However,
wake_hold has inconsistent behavior here too like in case of split-LLC
systems.

o Combination of hint

Test: tip no-hint fork_affine+wake_affine fork_spread+wake_hold
1-groups: 2.38 (0.00 pct) 2.33 (2.10 pct) 1.98 (16.80 pct) 3.44 (-44.53 pct)
2-groups: 3.84 (0.00 pct) 3.64 (5.20 pct) 3.43 (10.67 pct) 4.39 (-14.32 pct)
4-groups: 4.50 (0.00 pct) 4.39 (2.44 pct) 4.16 (7.55 pct) 4.52 (-0.44 pct)
8-groups: 5.84 (0.00 pct) 5.94 (-1.71 pct) 6.01 (-2.91 pct) 5.99 (-2.56 pct)
16-groups: 11.91 (0.00 pct) 12.37 (-3.86 pct) 9.65 (18.97 pct) 10.63 (10.74 pct)

With fork affine and wake affine, the benefit compounds based on the
test results.

- schbench

o fork time hints

#workers: tip no-hint fork_affine fork_spread
1: 18.00 (0.00 pct) 14.00 (22.22 pct) 16.00 (11.11 pct) 15.00 (16.66 pct)
2: 17.00 (0.00 pct) 18.00 (-5.88 pct) 16.00 (5.88 pct) 19.00 (-11.76 pct)
4: 20.00 (0.00 pct) 21.00 (-5.00 pct) 19.00 (5.00 pct) 21.00 (-5.00 pct)
8: 28.00 (0.00 pct) 29.00 (-3.57 pct) 27.00 (3.57 pct) 28.00 (0.00 pct)
16: 45.00 (0.00 pct) 46.00 (-2.22 pct) 42.00 (6.66 pct) 45.00 (0.00 pct)
32: 79.00 (0.00 pct) 79.00 (0.00 pct) 74.00 (6.32 pct) 79.00 (0.00 pct)
64: 166.00 (0.00 pct) 167.00 (-0.60 pct) 163.00 (1.80 pct) 165.00 (0.60 pct)
128: 270.00 (0.00 pct) 291.00 (-7.77 pct) 276.00 (-2.22 pct) 297.00 (-10.00 pct)
256: 42432.00 (0.00 pct) 42560.00 (-0.30 pct) 42816.00 (-0.90 pct) 42304.00 (0.30 pct)
512: 90240.00 (0.00 pct) 90240.00 (0.00 pct) 90240.00 (0.00 pct) 90496.00 (-0.28 pct)

An affine fork benefits the test system however the spread at fork is
not ideal for it.

o Wakeup hints

#workers: tip no-hint wake_affine wake_hold
1: 18.00 (0.00 pct) 14.00 (22.22 pct) 15.00 (16.66 pct) 17.00 (5.55 pct)
2: 17.00 (0.00 pct) 18.00 (-5.88 pct) 15.00 (11.76 pct) 18.00 (-5.88 pct)
4: 20.00 (0.00 pct) 21.00 (-5.00 pct) 19.00 (5.00 pct) 21.00 (-5.00 pct)
8: 28.00 (0.00 pct) 29.00 (-3.57 pct) 27.00 (3.57 pct) 29.00 (-3.57 pct)
16: 45.00 (0.00 pct) 46.00 (-2.22 pct) 45.00 (0.00 pct) 48.00 (-6.66 pct)
32: 79.00 (0.00 pct) 79.00 (0.00 pct) 80.00 (-1.26 pct) 82.00 (-3.79 pct)
64: 166.00 (0.00 pct) 167.00 (-0.60 pct) 144.00 (13.25 pct) 165.00 (0.60 pct)
128: 270.00 (0.00 pct) 291.00 (-7.77 pct) 28384.00 (-10412.59 pct) 256.00 (5.18 pct)
256: 42432.00 (0.00 pct) 42560.00 (-0.30 pct) 60352.00 (-42.23 pct) 43584.00 (-2.71 pct)
512: 90240.00 (0.00 pct) 90240.00 (0.00 pct) 116608.00 (-29.21 pct) 91008.00 (-0.85 pct)

Affine wakeup is beneficial as long as the number of workers is less
than the CPUs in each socket. Beyond that, the socket gets overloaded
and leads to massive increase in tail latency like what was seen
in the split-LLC case. Staying on same LLC benefits from initial
distribution and having an idle CPU in the same LLC unless system is
heavily overloaded.

o Combination of hints

#workers: tip no-hint fork_affine+wake_affine fork_spread+wake_hold
1: 18.00 (0.00 pct) 14.00 (22.22 pct) 14.00 (22.22 pct) 14.00 (22.22 pct)
2: 17.00 (0.00 pct) 18.00 (-5.88 pct) 14.00 (17.64 pct) 16.00 (5.88 pct)
4: 20.00 (0.00 pct) 21.00 (-5.00 pct) 19.00 (5.00 pct) 21.00 (-5.00 pct)
8: 28.00 (0.00 pct) 29.00 (-3.57 pct) 26.00 (7.14 pct) 28.00 (0.00 pct)
16: 45.00 (0.00 pct) 46.00 (-2.22 pct) 40.00 (11.11 pct) 46.00 (-2.22 pct)
32: 79.00 (0.00 pct) 79.00 (0.00 pct) 81.00 (-2.53 pct) 80.00 (-1.26 pct)
64: 166.00 (0.00 pct) 167.00 (-0.60 pct) 142.00 (14.45 pct) 166.00 (0.00 pct)
128: 270.00 (0.00 pct) 291.00 (-7.77 pct) 28192.00 (-10341.48 pct) 251.00 (7.03 pct)
256: 42432.00 (0.00 pct) 42560.00 (-0.30 pct) 62016.00 (-46.15 pct) 42816.00 (-0.90 pct)
512: 90240.00 (0.00 pct) 90240.00 (0.00 pct) 108672.00 (-20.42 pct) 93056.00 (-3.12 pct)

Combination the right hints compounds the benefit. Affine hints show
good improvements until the LLC threshold is crossed. A good spread
and holding the position helps in reducing expensive cross socket
migration benefiting the benchmark in some cases.

- tbench

o Fork time hints

Clients: tip no-hint fork_affine fork_spread
1 122.36 (0.00 pct) 121.33 (-0.84 pct) 121.49 (-0.71 pct) 134.69 (10.07 pct)
2 247.13 (0.00 pct) 266.78 (7.95 pct) 271.97 (10.05 pct) 245.39 (-0.70 pct)
4 516.31 (0.00 pct) 537.59 (4.12 pct) 511.37 (-0.95 pct) 508.46 (-1.52 pct)
8 1033.53 (0.00 pct) 1106.22 (7.03 pct) 1010.87 (-2.19 pct) 1175.12 (13.69 pct)
16 3094.23 (0.00 pct) 2509.74 (-18.88 pct) 3410.97 (10.23 pct) 2588.06 (-16.35 pct)
32 10757.40 (0.00 pct) 10585.54 (-1.59 pct) 10481.10 (-2.56 pct) 10711.71 (-0.42 pct)
64 16170.10 (0.00 pct) 15864.99 (-1.88 pct) 15856.51 (-1.93 pct) 16009.95 (-0.99 pct)
128 15168.29 (0.00 pct) 16547.25 (9.09 pct) 23796.79 (56.88 pct) 14229.63 (-6.18 pct)
256 32474.25 (0.00 pct) 32100.39 (-1.15 pct) 31633.93 (-2.58 pct) 32513.67 (0.12 pct)
512 29543.03 (0.00 pct) 29537.67 (-0.01 pct) 29282.98 (-0.88 pct) 29894.14 (1.18 pct)

An affine fork seems to benefit tbench when the system is fully
loaded in case of workloads such as tbench on the test system. This
is quite opposite of the unified LLC system which benefitted from
spreading on fork.

- Stream

- 10 runs

o 1 thread

Test: tip no_hint fork_spread
Copy: 28127.35 (0.00 pct) 28367.72 (0.85 pct) 27293.80 (-2.96 pct)
Scale: 16248.13 (0.00 pct) 16332.26 (0.51 pct) 15901.45 (-2.13 pct)
Add: 17146.86 (0.00 pct) 17491.69 (2.01 pct) 17062.82 (-0.49 pct)
Triad: 17552.85 (0.00 pct) 17723.45 (0.97 pct) 17301.79 (-1.43 pct)

o 2 threads

Test: tip no_hint fork_spread
Copy: 48920.60 (0.00 pct) 49831.92 (1.86 pct) 53687.16 (9.74 pct)
Scale: 27370.11 (0.00 pct) 27560.13 (0.69 pct) 32197.46 (17.63 pct)
Add: 29909.84 (0.00 pct) 30363.41 (1.51 pct) 34311.99 (14.71 pct)
Triad: 30225.18 (0.00 pct) 30520.83 (0.97 pct) 34982.68 (15.74 pct)

o 4 threads

Test: tip no_hint fork_spread
Copy: 79603.85 (0.00 pct) 80394.36 (0.99 pct) 89379.42 (12.28 pct)
Scale: 46870.97 (0.00 pct) 47417.41 (1.16 pct) 54352.83 (15.96 pct)
Add: 52066.71 (0.00 pct) 52623.44 (1.06 pct) 59269.40 (13.83 pct)
Triad: 52069.18 (0.00 pct) 52646.49 (1.10 pct) 59830.77 (14.90 pct)

o 8 threads

Test: tip no_hint fork_spread
Copy: 113566.60 (0.00 pct) 113613.06 (0.04 pct) 128656.04 (13.28 pct)
Scale: 71443.07 (0.00 pct) 71722.27 (0.39 pct) 79848.75 (11.76 pct)
Add: 81408.48 (0.00 pct) 81491.98 (0.10 pct) 90695.65 (11.40 pct)
Triad: 81987.96 (0.00 pct) 81992.01 (0.00 pct) 92047.47 (12.26 pct)

o 16 threads

Test: tip no_hint fork_spread
Copy: 152139.80 (0.00 pct) 151279.55 (-0.56 pct) 148607.24 (-2.32 pct)
Scale: 103163.08 (0.00 pct) 103628.35 (0.45 pct) 102184.90 (-0.94 pct)
Add: 113404.79 (0.00 pct) 113284.87 (-0.10 pct) 111183.33 (-1.95 pct)
Triad: 115388.28 (0.00 pct) 116244.97 (0.74 pct) 114411.79 (-0.84 pct)

o 32 threads

Test: tip no_hint fork_spread
Copy: 161240.24 (0.00 pct) 161470.60 (0.14 pct) 162977.90 (1.07 pct)
Scale: 124507.68 (0.00 pct) 126147.95 (1.31 pct) 126678.26 (1.74 pct)
Add: 130164.90 (0.00 pct) 130225.95 (0.04 pct) 131567.75 (1.07 pct)
Triad: 132933.44 (0.00 pct) 134431.69 (1.12 pct) 135044.50 (1.58 pct)

- 100 runs

o 1 thread

Test: tip no_hint fork_spread
Copy: 28087.03 (0.00 pct) 28391.25 (1.08 pct) 27232.70 (-3.04 pct)
Scale: 16187.98 (0.00 pct) 16295.17 (0.66 pct) 15827.29 (-2.22 pct)
Add: 17122.77 (0.00 pct) 17497.03 (2.18 pct) 16987.19 (-0.79 pct)
Triad: 17524.57 (0.00 pct) 17770.79 (1.40 pct) 17220.50 (-1.73 pct)

o 2 threads

Test: tip no_hint fork_spread
Copy: 49153.60 (0.00 pct) 49578.59 (0.86 pct) 52587.24 (6.98 pct)
Scale: 27712.34 (0.00 pct) 27634.70 (-0.28 pct) 30989.75 (11.82 pct)
Add: 30252.12 (0.00 pct) 30473.89 (0.73 pct) 33258.64 (9.93 pct)
Triad: 30526.42 (0.00 pct) 30643.25 (0.38 pct) 33795.72 (10.70 pct)
o 4 threads

Test: tip no_hint fork_spread
Copy: 81401.84 (0.00 pct) 81154.71 (-0.30 pct) 92914.24 (14.14 pct)
Scale: 47690.70 (0.00 pct) 47356.46 (-0.70 pct) 54671.55 (14.63 pct)
Add: 52859.84 (0.00 pct) 52586.28 (-0.51 pct) 59958.56 (13.42 pct)
Triad: 52931.46 (0.00 pct) 52647.24 (-0.53 pct) 60669.41 (14.61 pct)

o 8 threads

Test: tip no_hint fork_spread
Copy: 115285.88 (0.00 pct) 116381.75 (0.95 pct) 132874.03 (15.25 pct)
Scale: 72189.01 (0.00 pct) 72751.02 (0.77 pct) 82349.53 (14.07 pct)
Add: 82438.52 (0.00 pct) 82758.78 (0.38 pct) 93489.17 (13.40 pct)
Triad: 82791.24 (0.00 pct) 83446.12 (0.79 pct) 95010.20 (14.75 pct)

o 16 threads

Test: tip no_hint fork_spread
Copy: 152361.42 (0.00 pct) 151922.38 (-0.28 pct) 152300.21 (-0.04 pct)
Scale: 103042.94 (0.00 pct) 103833.81 (0.76 pct) 104310.93 (1.23 pct)
Add: 114032.41 (0.00 pct) 113004.05 (-0.90 pct) 113682.56 (-0.30 pct)
Triad: 115319.73 (0.00 pct) 116174.81 (0.74 pct) 116782.15 (1.26 pct)

o 32 threads

Test: tip no_hint fork_spread
Copy: 161485.98 (0.00 pct) 162183.58 (0.43 pct) 163371.33 (1.16 pct)
Scale: 125366.83 (0.00 pct) 126356.26 (0.78 pct) 126856.35 (1.18 pct)
Add: 129846.24 (0.00 pct) 130469.97 (0.48 pct) 131863.96 (1.55 pct)
Triad: 133523.78 (0.00 pct) 134498.68 (0.73 pct) 134996.96 (1.10 pct)

In case of Stream, the benchmark is benefiting from spreading early as
it reduces cache-contention and has an added advantage of the higher
cross-socket bandwidth. After Mel Gorman's NUMA imbalance rework for
split-LLC systems, (commit: e496132ebedd "sched/fair: Adjust the
allowed NUMA imbalance when SD_NUMA spans multiple LLCs") systems such
as AMD EPYC spread tasks by default once the number of tasks in NUMA
domain crosses the number of LLCs in it thus not needing a hint anymore
to spread the Stream threads optimally.

- Limitations in current implementation

o The wakeup path uses sd_shared->has_idle_core as a bailout metric for
hint in the subsequent wakeup path. This metric is not accurate and
can lead to tasks piling on the same CPU. There exist other
alternatives such as atomically reading sd_shared->nr_busy_cpus that
can lead to higher scheduling latency.

o Considerations are not made for task pinning and asymmetric CPU
capacity. The logic today assumes all CPUs in the system as equal.

o Parts of implementation assume a SMT-2 system to calculate the number
of cores in sched group. This can be improved by adding a generic
logic that counts number of online cores when the topology is
generated to account for systems with different SMT configurations
and be accurate after a CPU hotplug.

o The API exposes low-level decision points which might not be ideal
for user with little to no knowledge of scheduler internals.
As such, an abstract user hint should map to an appropriate
combination of these low-level hints.

# Future Work

o Fix the shortcomings exposed by the prototype - mainly the bailout
point in the wakeup path.
o Explore other possible hints to influence task placement that can
benefit other workloads.
o Test hinting on larger workloads containing tasks exhibiting different
behavior.
o Make hinting a usable feature based on community feedback.
o Deciding on an interface through which users can provide hints to the
scheduler.
o Enumerating the hints that are sensible to the user.

Any feedback on the prototype, approach, and design is
highly appreciated.

--
K Prateek Nayak (5):
task_struct: Add field for per-task userspace scheduler hint
prctl: Add interface and helper functions to set hints
sched/fair: Add support for hints in the subsequent wakeup path
sched/fair: Consider hints in the initial task wakeup path
sched/fair: Add exception for hints in load balancing path

include/linux/sched.h | 2 +
include/uapi/linux/prctl.h | 9 ++++
init/init_task.c | 1 +
kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 98 ++++++++++++++++++++++++++++++++++++--
kernel/sys.c | 5 ++
6 files changed, 195 insertions(+), 3 deletions(-)

--
2.25.1