Re: [bug report] WARNING: CPU: 0 PID: 49573 at kernel/sched/rt.c:802 rq_offline_rt+0x24d/0x260
From: zhaowenhui (A)
Date: Mon May 13 2024 - 10:10:15 EST
在 2024/4/7 22:06, zhaowenhui (A) 写道:
More information:
1. RT_RUNTIME_SHARE is enabled.
2. We continuously create and remove cpu cgroups. We use cgexec to do
some tasks like "tree" or "ps" in these cgroups and the rt_runtime_us in
these cgroups are set to 2000~6000.
3. There are frequent cpu offline/online operations, so it will trigger
__disable_runtime.
Every time we run these operations after reboot, this warning will
happen easily.
--------reproduce shell scripts----------
[1] on_off.sh
#!/bin/bash
let cpu_end=8
while true
do
for i in `seq 0 $cpu_end`
do
echo $i
let j=i+1
echo 0 > /sys/devices/system/cpu/cpu${i}/online
echo 0 > /sys/devices/system/cpu/cpu${j}/online
sleep 0.5
echo 1 > /sys/devices/system/cpu/cpu${i}/online
echo 1 > /sys/devices/system/cpu/cpu${j}/online
done
done
[2] rt_test.sh
#!/bin/bsh
# set -x
name=test$$
cpu_dir=/sys/fs/cgroup/cpu/$name
mkdir $cpu_dir
echo $rt_runtime_us > $cpu_dir/cpu.rt_runtime_us &>/dev/null
function cg_change_rt_runtime_us()
{
while true
do
let tmp=2+RANDOM%4
let quota=tmp*1000
echo $quota > $cpu_dir/cpu.rt_runtime_us
done
}
function rt_test1()
{
while true
do
cgexec -g cpu:/${name} chrt 20 tree /sys/fs/cgroup/cpu
&>/dev/null
done
}
function rt_test2()
{
while true
do
cgexec -g cpu:/${name} chrt 30 ps aux &>/dev/null
done
}
rt_test1 &
rt_test2 &
cg_change_rt_runtime_us &> /dev/null &
[3] start_repro.sh
function set_env(){
echo RT_RUNTIME_SHARE > /sys/kernel/debug/sched/features
echo RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
}
function run_test(){
for i in 0 1
do
nohup sh rt_test.sh &
nohup sh rt_test.sh &
nohup sh rt_test.sh &
nohup sh rt_test.sh &
done
nohup sh on_off.sh &>/dev/null &
}
set_env
run_test
--------reproduce shell scripts----------
Running the start_repro.sh can easily reproduce this issue.
And when I debug in __disable_runtime, I found that the rtrq have
borrow some runtime, because rtrq->runtime is greater than
rtb->rt_runtime, but its span has only one cpu which is itself.
[ 43.686441] [__disable_runtime] begin, rq: ff3113087e5b4ec0 , cpu: 159
[ 43.686449] [__disable_runtime] rtrq: ff3112c9e9259800 ,
rtrq->runtime :3037500
[ 43.686450] [__disable_runtime] rtb->period: 1000000000 ,
rtb->rt_runtime: 3000000
[ 43.686451] [__disable_runtime] want original : -37500
[ 43.686451] [__disable_runtime] after get want, rtrq:
ff3112c9e9259800 , rtrq->runtime :3037500
[ 43.686453] [ERROR!] span have cpu: 159
Do you have any idea about this? 😄
---
Regards
Zhao Wenhui