sched: issue with scheduling on 6.13–6.19, partially improved in 7.0.0

From: Vladimir Lukianov

Date: Thu Apr 02 2026 - 06:40:48 EST

Hi, we are observing the issue with scheduling on kernel 6.14 under our workload.

**Issue**
A process running in a high-weight cgroup (for example kubepod.slice) can cause
the scheduler to effectively starve tasks in lower-weight cgroups (user.slice,
system.slice, and the root cgroup), including kworker. This
causes complete kworker/user.slice/system.slice unresponsiveness on the
affected CPU, dependent processes in turn get stuck. And kernel may emit:
BUG: workqueue lockup - pool cpus=15 node=0 flags=0x0 nice=0 stuck for 29772s!

In our case, this resulted in a service outage, at another time in node
unavailability.

**Environment:**
- Architecture: x86_64
- 2 and more cpus
- reproduced on kernel version 6.13.0, 6.13.9, 6.14.8, 6.18.7, 6.19-rc7,
7.0.0-rc5/6 (not reproduced on 6.6.110, 6.10.9, 6.12.67)
- kernel configured with: CONFIG_CGROUP_SCHED=y CONFIG_GROUP_SCHED_WEIGHT=y
CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y

**Reproduction script**

Briefly, the script simulates a situation where there is a target CPU and a
CPU-bound task running on it. Then, to cause starvation on this CPU, we make
the following steps:
- create a cgroup with high weight
- add a CPU-bound process on the target CPU that belongs to the cgroup
- add to the cgroup "enough" processes that frequently enqueue/dequeue and are
scheduled on other CPUs. Then we try to measure the delay in task scheduling
with PLACE_LAG and NO_PLACE_LAG (which basically correspond to two scheduling
strategies).

In this scenario, we may observe a situation where even a simple echo hi
experiences delays in being enqueued on the target CPU and another CPU-bound
process that was on the target CPU but belong to user.slice/system.slice/root
cgroup will also be starved.
You may use bpftrace to see that only one task scheduled continuously
```bash
bpftrace -e 'kretprobe:pick_next_task_fair / cpu == CPU_NUM / {
$task = (struct task_struct*)retval; @tasks[$task->pid]++; } interval:s:1 {
print(@tasks); clear(@tasks); printf("\n"); }'
```

The script should be launched as root. You may specify a different number of
spinners and the target CPU. You may launch it a few times to catch the issue

The script itself

```bash
spinners=${1:-20}
cpu_num=${2:-13}

# Cleanup from previous test
echo 1 > /sys/fs/cgroup/testgroup/cgroup.kill

# In case when On target CPU launched cpu-bound process in root cgroup
taskset -c "$cpu_num" bash -c 'echo $$ > /sys/fs/cgroup/cgroup.procs; while true; do :; done &'

#
# And there is cgroup with high weight and cpu-bound process and bunch of "io-bound" processes/threads
# (which will often do place_entity) in _same_ cgroup - this will affect scheduler
# Note: randomness in place_entity timing matters (that is why RANDOM used)
#
mkdir /sys/fs/cgroup/testgroup
echo 2000 > /sys/fs/cgroup/testgroup/cpu.weight

taskset -c "$cpu_num" bash -c 'echo $$ > /sys/fs/cgroup/testgroup/cgroup.procs; while true; do :; done & '

bash -c '
echo $$ > /sys/fs/cgroup/testgroup/cgroup.procs
for i in $(seq 1 "$1"); do
echo Spinner $i
while true; do sleep 1; done &
sleep "0.$((RANDOM % 1000 + 1))"
done
' _ "$spinners"

just_echo_hi() {
echo Trying to echo hi on CPU#"$cpu_num"
for i in {1..4}; do
sleep "0.$((RANDOM % 1000 + 1))"
echo -n "Run #$i : "
/usr/bin/time -f "%E" taskset -c "$cpu_num" echo hi 2>&1 >/dev/null
done
}

echo """ Now, creted cgroup with spinner on target CPU and bunch of "io-bound"
processes on other cpus will affect the scheduler """

echo WITH NO_PLACE_LAG
echo NO_PLACE_LAG > /sys/kernel/debug/sched/features
just_echo_hi

echo WITH PLACE_LAG$DEFAULT STRATEGY$
echo PLACE_LAG > /sys/kernel/debug/sched/features
just_echo_hi
```

**Test on different kernels**

On kernel 6.6
--------------------
WITH NO_PLACE_LAG
Trying to echo hi on CPU#13
Run #1 : 0:00.04
Run #2 : 0:00.06
Run #3 : 0:00.02
Run #4 : 0:00.02
WITH PLACE_LAG(DEFAULT STRATEGY)
Trying to echo hi on CPU#13
Run #1 : 0:00.01
Run #2 : 0:00.03
Run #3 : 0:00.02
Run #4 : 0:00.00

On kernel 6.14
--------------------
WITH NO_PLACE_LAG
Trying to echo hi on CPU#13
Run #1 : 0:00.00
Run #2 : 0:00.01
Run #3 : 0:00.02
Run #4 : 0:00.00
WITH PLACE_LAG(DEFAULT STRATEGY)
Trying to echo hi on CPU#13
Run #1 : 20:25.32
Run #2 : 1:12.65
Run #3 : 0:17.50
Run #4 : 23:35.19

On kernel 7.0.0-rc5
--------------------
WITH NO_PLACE_LAG
Trying to echo hi on CPU#13
Run #1 : 0:00.01
Run #2 : 0:00.01
Run #3 : 0:00.01
Run #4 : 0:00.05
WITH PLACE_LAG(DEFAULT STRATEGY)
Trying to echo hi on CPU#13
Run #1 : 0:09.51
Run #2 : 0:14.57
Run #3 : 1:45.80
Run #4 : 0:00.12

On kernel 7.0.0-rc6
--------------------
WITH NO_PLACE_LAG
Trying to echo hi on CPU#13
Run #1 : 0:00.03
Run #2 : 0:00.03
Run #3 : 0:00.02
Run #4 : 0:00.02
WITH PLACE_LAG(DEFAULT STRATEGY)
Trying to echo hi on CPU#13
Run #1 : 0:00.15
Run #2 : 0:00.67
Run #3 : 2:28.35
Run #4 : 1:02.87

So on kernel 6.14 (and up to 6.19), delays (in fact the time when one task is
consuming most of the CPU resources) are quite hi - up to tents of minutes (we
observed 1 hour and more). Fixes in 7.0.0 make main scheduler strategy more
"fair", but issue doesn't fixed completely and issue may continue up to minute
or few minutes.

Regards, Vladimir