Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload

From: Vincent Guittot

Date: Wed Apr 01 2026 - 10:09:59 EST

On Wed, 1 Apr 2026 at 12:49, Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:
>
> On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
> >
> > On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:
> > > > >
> > > > > Dear Linux maintainers and reviewers,
> > > > >
> > > > > I am writing to report a severe scheduling latency issue we recently
> > > > > discovered on Linux Kernel 6.12.
> > > > >
> > > > > Issue Description
> > > > >
> > > > > We observed that when running a specific background workload pattern,
> > > > > certain tasks experience excessive scheduling latency. The delay from
> > > > > the runnable state to running on the CPU exceeds 10 seconds, and in
> > > > > extreme cases, it reaches up to 100 seconds.
> > > > >
> > > > > Environment Details
> > > > >
> > > > > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > > > > Architecture: [ ARM64]
> > > > > Hardware: T7300
> > > > > Config: gki_defconfig
> > > > >
> > > > > RT-app‘s workload Pattern:
> > > > >
> > > > > {
> > > > > "tasks" : {
> > > > > "t0" : {
> > > > > "instance" : 40,
> > > > > "priority" : 0,
> > > > > "cpus" : [ 0, 1, 2, 3 ],
> > > > > "taskgroup" : "/background",
> > > > > "loop" : -1,
> > > > > "run" : 200,
> > > > > "sleep" : 50
> > > > > }
> > > > > }
> > > > > }
> > > > >
> > > > > And we have applied the following patchs:
> > > > >
> > > > > https://lore.kernel.org/all/20251216111321.966709786@xxxxxxxxxxxxxxxxxxx/
> > > > > https://lore.kernel.org/all/20260106170509.413636243@xxxxxxxxxxxxxxxxxxx/
> > > > > https://lore.kernel.org/all/20260323134533.805879358@xxxxxxxxxxxxxxxxxxx/
> > > > >
> > > > >
> > > > > Could you please advise if there are known changes in the eevdf in
> > > > > 6.12 that might affect this specific workload pattern?
> > > > >
> > > >
> > > Thanks for the quick response！
> > >
> > > > Could you maybe instead point to some source for the runqslower binary
> > > > you attached? I don't think folks will run random binaries.
> > >
> > > We use the code in kernel "tools/bpf/runqslower".
> > >
> > > >
> > > > Also, it looks like the RT-app description uses the background cgroup,
> > > > can you share the cgroup configuration you have set for that?
> > >
> > > Our "background" cgroup does not have any special configurations applied.
> > >
> > > cpu.shares: Set to 1024, which is consistent with other cgroups on the system.
> > > Bandwidth Control: It is disabled (no cpu.cfs_quota_us limits set).
> > >
> > > >
> > > > Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
> > > > not sure the audience here is going to pay much attention to GKI based
> > > > reports. Were you using any vendorhooks?
> > >
> > > We have verified this on a GKI kernel with all vendor hooks removed.
> > > The issue still reproduces in this environment. This suggests the
> > > problem is not directly caused by our vendor-specific modifications.
> >
> > Did you try on the latest android mainline kernel which is based on
> > v6.19 ? This would help determine if the issue only happens on v6.12
> > or on more recent kernels too
>
> We also tested this case on android kernel 6.18. The issue is still
> reproducible, although the probability of occurrence is significantly
> lower compared to 6.12.
>
>
> >
> > I ran your rt-app json file on the latest tip/sched/core but I don't
> > see any scheduling issue
> >
> > >
> > > We conducted an experiment by disabling the DELAY_DEQUEUE feature.
> > > After turning it off, we observed a significant increase in threads
> > > with extremely long runnable times. Even kworkers started exhibiting
> > > timeout phenomena.
> >
> > Just to make sure, the problem happens even if you don't disable DELAY_DEQUEUE ?
>
> Yes, we see this problem with both DELAY_DEQUEUE on and off.
>
> Additionally, we noticed that the tasks suffering from long scheduling
> latencies frequently belong to different cgroups (e.g., foreground),
> rather than the background cgroup where the rt-app load is running.
> This unexpected cross-group interference is quite puzzling to us...

Do you have more details about what is running at the same time as
rt-app? I thought the problem occurred on the rt-app threads but it
seems to happen on other threads running simultaneously.

I tried your rt-app JSON file in one cgroup with another rt-app
running small tasks in a different cgroup on tip/sched/core and
v6.12.79 kernel, and I can't trigger any scheduling latency bigger
than 35ms which is not that far from the theoretical 30ms = 10 tasks
per cpu * 3ms (2.8ms slices with 1ms tick)

Vincent

>
> Thanks!
> ---
> xuewen