[PATCH 5.10.y] sched/fair: Fix task starvation caused by incorrect vruntime_normalized check

From: hu.shengming

Date: Wed Mar 11 2026 - 05:52:34 EST


From: luohaiyang10243395 <luo.haiyang@xxxxxxxxxx>

when a previous update_rq_clock() happened inside a {soft,}irq
region, we stop ->clock_task and only update the prev_irq_time
stamp. it means clock_task remains unchanged for a period of
time. As a result, we may observe that a task has actually been
running but its sum_exec_runtime == 0. as confirmed by ftrace:

<...>-3615452 [040] d... 1081329.791592: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9387a clock=0x3d776b17204b0 prev_irq_time=0x45480b8cc36
<...>-3615452 [040] ..s. 1081329.791596: softirq_entry: vec=3 [action=NET_RX]
<...>-3615452 [040] d.s. 1081329.791619: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b943a2 clock=0x3d776b1720fd8 prev_irq_time=0x45480b8cc36
<...>-3615452 [040] ..s. 1081329.791631: softirq_exit: vec=3 [action=NET_RX]
<...>-3615452 [040] ..s. 1081329.791631: softirq_entry: vec=6 [action=TASKLET]
<...>-3615452 [040] ..s. 1081329.791632: softirq_exit: vec=6 [action=TASKLET]
<...>-3615452 [040] d... 1081329.791633: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b17278ec prev_irq_time=0x45480b8d065
<...>-3615452 [040] d... 1081329.791637: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b172b0cf prev_irq_time=0x45480b90848
<...>-3615452 [040] d... 1081329.791639: sched_switch: prev_comm=futex prev_pid=3615452 prev_prio=120 prev_state=S ==> next_comm=futex next_pid=3615454 next_prio=120
<...>-3615454 [040] d... 1081329.791643: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b172c1fd prev_irq_time=0x45480b9197
<...>-3615454 [040] d... 1081329.791644: sched_switch: prev_comm=futex prev_pid=3615454 prev_prio=120 prev_state=S ==> next_comm=sched_yield next_pid=2439180 next_prio=12
sched_yield-2439180 [040] d... 1081329.791645: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b172d7b3 prev_irq_time=0x45480b92f2c

In our production environment, we have two tasks which bind to on cpu,
nginxA:
int main()
{
pthread_mutex_lock(&mutex);
do_something();
pthread_mutex_unlock(&mutex);
}

nginxB:
int main()
{
while(nginxA not exit)
sched_yield();
}

ngnixA starved due to the following sequence of events:
1. other task fork tow tasks: nginxA and nginxB. When the system has been running
for a long time, the task vruntime is a very large value.
2. nginxA immediately goes to sleep due to lock contention but its sum_exec_runtime == 0.
3. nginxA and nginxB attach to new cgroup, nginxA vruntime do not adjust to new cfs_rq
due to sum_exec_runtime == 0 in vruntime_normalized.
4. new cfs_rq.min_vruntime overflows quickly due to nginxB continues running, When nginxA is
woken up by the scheduler, since nginxA vruntime >> nginxB vruntime, nginxA cannot get CPU
for a long period of time.

Signed-off-by: Luo Haiyang <luo.haiyang@xxxxxxxxxx>
Tested-by: Lu Zhongjun <lu.zhongjun@xxxxxxxxxx>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c11d59bea0ea..e82806a48ef3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11073,7 +11073,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)

static inline bool vruntime_normalized(struct task_struct *p)
{
- struct sched_entity *se = &p->se;
+ unsigned long nr_switches = p->nvcsw + p->nivcsw;

/*
* In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
@@ -11092,7 +11092,7 @@ static inline bool vruntime_normalized(struct task_struct *p)
* - A task which has been woken up by try_to_wake_up() and
* waiting for actually being woken up by sched_ttwu_pending().
*/
- if (!se->sum_exec_runtime ||
+ if (!nr_switches ||
(p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;

--
2.25.1