Re: Very high scheduling delay with plenty of idle CPUs

From: Saravana Kannan
Date: Mon Nov 11 2024 - 01:15:55 EST


On Sun, Nov 10, 2024 at 9:17 PM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>
> (+ Tobias)
>
> Hello Saravana,
>
> On 11/10/2024 11:19 AM, Saravana Kannan wrote:
> > On Fri, Nov 8, 2024 at 12:31 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >>
> >> On Thu, Nov 07, 2024 at 11:28:07PM -0800, Saravana Kannan wrote:
> >>> Hi scheduler folks,
> >>>
> >>> I'm running into some weird scheduling issues when testing non-sched
> >>> changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
> >>> this is an issue in earlier kernel versions or not.
> >>>
> >>
> >> It's a bit unfortunate you don't have a known good kernel there. Anyway,
> >> one thing that recently came up is that DELAY_DEQUEUE can cause some
> >> delays, specifically it can inhibit wakeup migration.
> >
> > I disabled DELAY_DEQUEUE and I'm still seeing preemptions or
> > scheduling latency (after wakeup)
>
> On the scheduling latency front, have you tried running with
> RUN_TO_PARITY and/or PLACE_LAG disabled. If the tick granularity on your
> system is less that the "base_slice_ns", disabling RUN_TO_PARITY can
> help switch to a newly woken up task slightly faster. Disabling
> PLACE_LAG makes sure the newly woken task is always eligible for
> selection. However, both come with the added disadvantage of a sharp
> increase in the number of involuntary context switches for some of the
> scenarios we have tested.

Yeah, I don't think I can just change these because that'd have a much
wider impact on power and performance. I really need something
isolated to the suspend/resume scenario. Or just a generic bug fix
where the scheduler does better CPU selection for a thread. I'm saying
better because I'd think this would be better from a power perspective
too in the specific example I gave.

> There is a separate thread from Cristian
> making a case to toggle these features via sysfs and keep them disabled
> by default [0]
>
> [0] https://lore.kernel.org/lkml/20241017052000.99200-1-cpru@xxxxxxxxxx/
>
> > when there are plenty of CPUs even
> > within the same cluster/frequency domain.
>
> I'm not aware of any recent EAS specific changes that could have led to
> larger scheduling latencies in the recent times but Tobias had reported
> a similar increase in kworker scheduling latency when EEVDF was first
> introduced in a different context [1]. I'm not sure if he is still
> observing the same behavior on the current upstream but would it be
> possible to check if you can see the large scheduling latency only
> starting with v6.6 (when EEVDF was introduced) and not on v6.5
> (ran with older CFS logic). I'm also assuming the system / benchmark
> does change the default scheduler related debug tunables, some of which
> went away in v6.6

Hmmm... I don't know if this is specific to EEVDF. But going back to
v6.5 has a lot of other hurdles that I don't want to get into.

>
> [1] https://lore.kernel.org/lkml/c7b38bc27cc2c480f0c5383366416455@xxxxxxxxxxxxx/
>
> >
> > Can we tell the scheduler to just spread out all the tasks during
> > suspend/resume? Doesn't make a lot of sense to try and save power
> > during a suspend/resume. It's almost always cheaper/better to do those
> > quickly.
>
> That would increase the resume latency right since each runnable task
> needs to go through a full idle CPU selection cycle? Isn't time a
> consideration / concern in the resume path? Unless we go through the
> slow path, it is very likely we'll end up making the same task
> placement decisions again?

I actually quickly hacked up the cpu_overutilized() function to return
true during suspend/resume and the threads are nicely spread out and
running in parallel. That actually reduces the total of the
dpm_resume*() phases from 90ms to 75ms on my Pixel 6.

Also, this whole email thread started because I'm optimizing the
suspend/resume code to reduce a lot of sleeps/wakeups and the number
of kworker threads. And with that + over utilization hack, resume time
has dropped to 60ms.

Peter,

Would you be open to the scheduler being aware of
dpm_suspend*()/dpm_resume*() phases and triggering the CPU
overutilized behavior during these phases? I know it's a very use case
specific behavior but how often do we NOT want to speed up
suspend/resume? We can make this a CONFIG or a kernel command line
option -- say, fast_suspend or something like that.

Thanks,
Saravana