Very high scheduling delay with plenty of idle CPUs

From: Saravana Kannan
Date: Fri Nov 08 2024 - 02:28:54 EST


Hi scheduler folks,

I'm running into some weird scheduling issues when testing non-sched
changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
this is an issue in earlier kernel versions or not.

The async suspend/resume code calls async_schedule_dev_nocall() to
queue up a bunch of work. These queued up work seem to be running in
kworker threads.

However, there have been many times where I see scheduling latency
(runnable, but not running) of 4.5 ms or higher for a kworker thread
when there are plenty of idle CPUs.

Does async_schedule_dev_nocall() have some weird limitations on where
they can be run? I know it has some NUMA related stuff, but the Pixel
6 doesn't have NUMA. This oddity unnecessarily increases
suspend/resume latency as it adds up across kworker threads. So, I'd
appreciate any insights on what might be happening?

If you know how to use perfetto (it's really pretty simple, all you
need to know is WASD and clicking), here's an example:
https://ui.perfetto.dev/#!/?s=e20045736e7dfa1e897db6489710061d2495be92

Thanks,
Saravana