[RFC PATCH] sched/rt: skip RT bandwidth accounting for unobserved CPU stalls

From: Imran Khan

Date: Thu May 07 2026 - 09:02:06 EST


After a CPU stall which the guest scheduler did not observe ( for example
KVM live-migration where stop_and_copy takes long), the next update_curr_rt()
charges a delta_exec equal to the entire stall to the current RT task and
also to rt_rq::rt_time. With the default sched_rt_runtime_us=950000 and
sched_rt_period_us=1000000, even a few seconds of stall can set rt_throttled
, dequeue the current RT task and keep it off the runq for multiple seconds.

For example following snippet shows one such instance where pid 30274
was the current task on CPU 45, during live migration. After live migration
it got preempted and has been on the runq for the last ~10 secs. CPU is idle
but RT task can't get on it because rt_runtime overrun has not been
compensated yet:

crash> runq -c 45
CPU 45 RUNQUEUE: ff1c8cb63d972840
CURRENT: PID: 0 TASK: ff1c8c77c6c7a080 COMMAND: "swapper/45"
RT PRIO_ARRAY: ff1c8cb63d972ac0
[ 0] PID: 30274 TASK: ff1c8c7d9aad4100 COMMAND: "NMSending"
[ 0] PID: 30791 TASK: ff1c8c7c2098a080 COMMAND: "cssdagent"

>>> per_cpu(prog["runqueues"], 45).clock_task.value_()
10537385941842
>>> per_cpu(prog["runqueues"], 45).rt.rt_time.value_()
6571872703

>>> per_cpu(prog["runqueues"], 45).clock_task.value_() - \
find_task(30274).se.exec_start.value_()
10537394410

This snippet is from a system using v5.15.y kernel and as of now I don't
have a vmcore with current upstream tip but I could reproduce similar time jump
on current tip as well.

This change resets delta_exec to zero upon detecting a guest pause and hence
prevents exorbitant jumps in rt_rq::rt_time.

Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
---
kernel/sched/rt.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

I have kept the patch RFC because I am not sure if it should be fixed on the
KVM side.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f69e1f16d9238..e8d83080c3842 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -7,6 +7,8 @@
#include "sched.h"
#include "pelt.h"

+#include <linux/kvm_para.h>
+
int sched_rr_timeslice = RR_TIMESLICE;
/* More than 4 hours if BW_SHIFT equals 20. */
static const u64 max_rt_runtime = MAX_BW;
@@ -989,6 +991,18 @@ static void update_curr_rt(struct rq *rq)
if (!rt_bandwidth_enabled())
return;

+ /*
+ * Forgive RT bandwidth charged across an unobserved CPU stall
+ * like KVM live-migration stop_and_copy.
+ *
+ * The magnitude check is to avoid race where the local softlockup
+ * hrtimer consumed PVCLOCK_GUEST_STOPPED bit before this
+ * update_curr_rt() call.
+ */
+ if (kvm_check_and_clear_guest_paused() ||
+ unlikely(delta_exec > (u64)sysctl_sched_rt_period * NSEC_PER_USEC))
+ delta_exec = 0;
+
for_each_sched_rt_entity(rt_se) {
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
int exceeded;

base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
--
2.34.1