Re: [QUESTION] sched/fair: EEVDF min_slice stalls in parent cgroup with a continuously running child task

From: chenjinghuang

Date: Thu Jun 18 2026 - 05:29:10 EST


On 6/17/2026 3:29 PM, Vincent Guittot wrote:
> On Wed, 17 Jun 2026 at 05:56, Chen Jinghuang <chenjinghuang2@xxxxxxxxxx> wrote:
>>
>> Hi all,
>>
>> I observed an unexpected behavior regarding the EEVDF min_slice update
>> mechanism in a hierarchical cgroup v1 setup. It appears that the parent
>> cgroup's min_slice can become stale when a child cgroup contains a
>> continuously running task (100% load, no sleep). The parent's min_slice
>> fails to update when global or entity slices change, until a dequeue/enqueue
>> event occurs in the child cgroup.
>>
>> Here is the topology of the scenario:
>>
>> Root cfs_rq
>> |
>> cgroup A (se A, contains cpu_sim_A)
>> |
>> cgroup A1 (se A1, contains cpu_sim_A1, 100% CPU load)
>>
>> Steps to reproduce:
>
> What is the value of /sys/kernel/debug/sched/base_slice_ns before
> starting your test?
>
Before starting the test, the default value is 2800000(2.8ms).
>> 1. Create cgroup A, and a sub-cgroup A1 under A.
>> 2. Move a task (cpu_sim_A) into cgroup A, and use the syscall
>> __NR_sched_setattr to explicitly set its slice to 3ms.
>> 3. Set the global base_slice_ns to 3ms:
>> echo 3000000 > /sys/kernel/debug/sched/base_slice_ns
>> 4. Move a 100% load task (cpu_sim_A1, which never voluntary sleeps) into
>> cgroup A1.
>> At this point, both cgroup A's min_slice and cgroup A1's min_slice are
>> observed as 0.1ms (the initialized or previous low value).
>
> This is a quite short slice value
>
Yes, it is a specific configuration used for this test scenario.
>> 5. Change the global base_slice_ns to 2.8ms:
>> echo 2800000 > /sys/kernel/debug/sched/base_slice_ns
>
> Changing the global sys/kernel/debug/sched/base_slice_ns at runtime is
> not an expected behavior because this value is assumed to be a default
> and constant value. This could be changed at the boot or something
> like but not during a use case
>
>>
>> Observations:
>> - cgroup A1's min_slice correctly updates to 2.8ms.
>> - However, cgroup A's min_slice remains unchanged (stuck at 0.1ms).
>> - Even if we use syscall(__NR_sched_setattr) to explicitly set cpu_sim_A's
>> slice in cgroup A to 2.8ms, cgroup A's min_slice still does not update.
>> - The only way to force cgroup A's min_slice to update to 2.8ms is to
>> trigger a dequeue/enqueue cycle for the task in cgroup A1 (e.g., by
>> renice it)
>
> When an entity uses the default slice value, this system-wide default
> value is not expected to change. We don't have a way to trigger an
> update of all cgroups on all CPUs, and we don't want one.
>
Oh, so this is probably why I noticed min_slice wasn't updating. Since that's
the rationale behind it, I guess we can consider the observations from the script
below as normal for now?
>>
>> It appears that a parent cgroup's min_slice is only updated when its
>> children are enqueued or dequeued. If a child task runs continuously
>> without sleeping, the parent's min_slice gets stuck and ignores any
>> changes to base_slice_ns or individual entity slices.
>
> It ignores changes to base_slice_ns but catches changes to the
> entity's custom slice.
>
Exactly, I observed that merely tweaking base_slice_ns does not trigger task
enqueue/dequeue, leaving min_slice un-updated. Below is the complete script I
used for reproduction:

#!/bin/bash

echo "Step 1: Creating cgroup hierarchy..."
mkdir -p /sys/fs/cgroup/cpu/A
mkdir -p /sys/fs/cgroup/cpu/A/A1

echo "Step 2: Starting A'task and setting custom slice to 3ms..."
# Start a 100% full-load CPU simulation task in the background with a 2-hour runtime
./cpu_sim -l 319861 -u 100 -s 7200 &
PID_A=$!
echo $PID_A > /sys/fs/cgroup/cpu/A/cgroup.procs
# set_sched_slice is user-space code used to adjust a task's slice.
./set_sched_slice $PID_A 3000000

echo "Step 3: Run a 100% load task in cgroup A1"
./cpu_sim -l 319861 -u 100 -s 7200 &
PID_A1=$!
echo 100000 > /sys/kernel/debug/sched/base_slice_ns
# Move to cgroup A1
echo $PID_A1 > /sys/fs/cgroup/cpu/A/A1/cgroup.procs

echo "Step 4: Setting global base_slice_ns to 3ms..."
echo 3000000 > /sys/kernel/debug/sched/base_slice_ns

echo "Step 5: set A'task custom slice to 2.8ms"
./set_sched_slice $PID_A 2800000

Group A's min_slice is observed at 0.1ms and group A1's at 2.8ms. Yet, no task in
group A has a 0.1ms slice. Group A is holding onto a stale min_slice value from
group A1 without updating.
(Note: Since there is no direct user-space interface to inspect min_slice, I captured
this by using a custom kernel module to print se->min_slice in real-time.)
>>
>> Is this expected by design in EEVDF, or is there a missing update hook?
>>
>> Any insights would be greatly appreciated.
>>
>> Thanks,
>> Chen Jinghuang
>