Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

From: Chen Yu
Date: Thu May 25 2023 - 03:49:14 EST


On 2023-05-22 at 09:10:33 +0200, Mike Galbraith wrote:
> On Mon, 2023-05-22 at 12:10 +0800, Chen Yu wrote:
> >
> > Meanwhile, I looked back at Yicong's proposal on waking up task
> > on local cluster first. It did show some improvement on Jacobsville,
> > I guess that could also be a chance to reduce C2C latency.
>
> Something else to consider is that communication data comes in many
> size chunks and volumes, and cache footprints can easily be entirely
> local constructs unrelated to any transferred data.
>
> At one extreme of the huge spectrum of possibilities, a couple less
> than brilliant tasks playing high speed ping-pong can bounce all over a
> box with zero consequences, but for a pair more akin to say Einstein
> and Bohr pondering chalkboards full of mind bending math and meeting
> occasionally at the water cooler to exchange snarky remarks, needlessly
> bouncing them about forces them to repopulate chalkboards, and C2C
> traffic you try to avoid via bounce you generate via bounce.
>
> Try as you may, methinks you're unlikely to succeed at avoiding C2C in
> a box where roughly half of all paths are C2C. What tasks have in
> their pockets and what they'll do with a CPU at any point in time is
> unknown and unknowable by the scheduler, dooming pinpoint placement
> accuracy as a goal.
>
I guess what you mean is that, for a wakee has large local data cache
footprint, it is not a good idea to wakeup the wakee on a remote core.
Because in that way the wakee has to repopulate the cache from scratch.
Yes, the problem is that currently the scheduler is lacking of metric
to indicate the task's working set, or per-task-cache-footprint-track
(although we have numa balancing to calculate per-task-node-statistics).
If provided with this cache-aware metric, the wakee can be put to a candidate
CPU where the cache locallity(either LLC or L2) is friendly to the wakee.
Because there is no such accurate metric, the heuristic seems to be an compromised
way to predict the task placement.

The C2C was mainly caused by accessing global tg->load, so besides
wakeup placement, there should also be other way to mitigate C2C,
such as reducing the frequency of accessing tg->load.

Besides that, while studying the history of wake_wide(), I suddenly
found that 10 years ago Michael has proposed exactly the same strategy to
check if task A and B are waking up each other, if they are, put them
together, otherwise, spread them to different LLC:
https://lkml.org/lkml/2013/3/6/73
And this version has finnaly evolved to what wake_wide() looks like today
in your patch:
https://marc.info/?l=linux-kernel&m=143688840122477
If I understand correctly, if wake_wide() can decide whether to wakeup the
task on an idle CPU on local LLC or remote LLC, does it also
make sense to extend wake_wide() for SMT domain and L2 domain?
Say, if wake_wide(p, nr_smt) returns true, then find an idle CPU on remote
SMT domain, otherwise scan for an idle CPU in local SMT domain. In this
case, does has_idle_core check matter?

thanks,
Chenyu