[PATCH] tools/sched_ext: scx_flatcg: Fix inverted vtime delta accounting

From: luoliang

Date: Wed Jun 17 2026 - 02:51:07 EST


From: Liang Luo <luoliang@xxxxxxxxxx>

In try_pick_next_cgroup(), a cgroup is charged its full slice upfront.
fcg_dispatch() is then supposed to refund the unused time by adding
(expected_end - now) * FCG_HWEIGHT_ONE / hweight to cvtime_delta:

__sync_fetch_and_add(&cgc->cvtime_delta,
(cpuc->cur_at + cgrp_slice_ns - now) *
FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));

However, if a cgroup yields early (expected_end > now), this adds a
positive value to cvtime_delta, which ultimately increases cvtime,
heavily penalizing the cgroup instead of refunding it.

Conversely, if a cgroup overuses its slice (now > expected_end), the
unsigned subtraction underflows to a massive u64. When multiplied by
FCG_HWEIGHT_ONE and divided by hweight via unsigned division, it
yields an enormous positive number. Adding this huge value to
cvtime_delta pushes the cgroup's virtual time roughly 3.25 days into
the future, permanently starving it.

cgrp_cap_budget() later applies cvtime_delta to cvtime reading it as
u64. Once dispatch can produce negative (refund) deltas, reading them
as u64 turns every refund into a huge positive value with the same
starving effect, so read and apply the delta as s64 and clamp the
result at zero.

Order the time delta as (now - expected_end) so overruns are positive
and early yields negative, and split the magnitude scaling by sign
since BPF has no signed division.

Observed on a nested cgroup hierarchy under mixed CPU load: cvtime
grew roughly 300x faster than wall-clock time, cvtime_delta reached
2^47..2^49, and tasks in cgroups stalled. After the change cvtime
advances at the expected weight-scaled rate and no anomalous deltas
appear.

Fixes: a4103eacc2ab ("sched_ext: Add a cgroup scheduler which uses flattened hierarchy")
Signed-off-by: Liang Luo <luoliang@xxxxxxxxxx>
---
tools/sched_ext/scx_flatcg.bpf.c | 37 ++++++++++++++++++++++++++------
1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index fec359581826..a37f5b876da8 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -249,15 +249,26 @@ static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc)

static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
{
- u64 delta, cvtime, max_budget;
+ s64 delta;
+ u64 cvtime, max_budget;

/*
* A node which is on the rbtree can't be pointed to from elsewhere yet
* and thus can't be updated and repositioned. Instead, we collect the
* vtime deltas separately and apply it asynchronously here.
*/
- delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
- cvtime = cgv_node->cvtime + delta;
+ delta = (s64)__sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
+ /*
+ * cvtime_delta may be negative (early-yield refund from fcg_dispatch()).
+ * Apply it in the signed domain and clamp the resulting cvtime at zero
+ * so that a refund larger than the accumulated cvtime doesn't roll the
+ * field negative.
+ */
+ s64 vtime = (s64)cgv_node->cvtime + delta;
+
+ if (vtime < 0)
+ vtime = 0;
+ cvtime = (u64)vtime;

/*
* Allow a cgroup to carry the maximum budget proportional to its
@@ -767,11 +778,25 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
* We want to update the vtime delta and then look for the next
* cgroup to execute but the latter needs to be done in a loop
* and we can't keep the lock held. Oh well...
+ *
+ * Charge the overrun or refund the unused time. The original
+ * (cur_at + slice - now) was inverted: early yields were
+ * penalized and overruns underflowed. Use (now - (cur_at +
+ * slice)) so overruns are positive and early yields negative;
+ * split the magnitude scaling by sign since BPF has no signed
+ * division.
*/
bpf_spin_lock(&cgv_tree_lock);
- __sync_fetch_and_add(&cgc->cvtime_delta,
- (cpuc->cur_at + cgrp_slice_ns - now) *
- FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
+ s64 excess = (s64)now - (s64)(cpuc->cur_at + cgrp_slice_ns);
+ s64 delta;
+
+ if (excess > 0)
+ delta = (s64)((u64)excess * FCG_HWEIGHT_ONE /
+ (cgc->hweight ?: 1));
+ else
+ delta = -(s64)((u64)-excess * FCG_HWEIGHT_ONE /
+ (cgc->hweight ?: 1));
+ __sync_fetch_and_add(&cgc->cvtime_delta, delta);
bpf_spin_unlock(&cgv_tree_lock);
} else {
stat_inc(FCG_STAT_CNS_GONE);
--
2.43.0