Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

From: Andrei Vagin
Date: Mon Apr 17 2023 - 15:25:17 EST


On Mon, Apr 10, 2023 at 11:17 AM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>
> On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@xxxxxxxxxx>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@xxxxxxxxxx>
> > > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxxx>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > > if (wake_flags & WF_TTWU) {
> > > > record_wakee(p);
> > > >
> > > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > + return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> I see, in seccomp case, even the idle CPU is not a better choice.
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@xxxxxxxxxxxxxx/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu.
> It depends. For waker and wakee that compete for cache resource and do
> not have share data, sometimes an idle target would be better.
> > Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
> I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
> become a auto-detect behavior so others can benefit from this.
>
> If I understand correctly, the scenario this patch deals with is:
> task A wakeups task B, task A and taks B have close relationship with each
> other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
> idle CPU.
>
> I'm thinking if the following logic would cover your case:
> 1. the waker A is a short duration one (waker will fall asleep soon)
> 2. the waker B is a short duration one (impact of B is minor)

In the seccomp case, A or B can be a non-short-duration but if they do
synchronous
switches they get all the benefits of WF_CURRENT_CPU.

> 3. the A->wakee_flips is 0 and A->last_wakee = B

In our case, a "supervisor" is written in golang and there are
goroutines that can be
executed from different system threads, so this condition will fail often too.

> 4. the A->wakee_flips is 0 and B->last_wakee = A
> 5, cpu(A)->nr_running = 1
>
> (3 and 4 mean that, A and B wake up each other, so it is likely that
> they share cache data, and they are good firends to be put together)
>
> If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
> can be set dynamically.
>
> thanks,
> Chenyu