Re: [PATCH] tools/sched_ext: scx_flatcg: Fix inverted vtime delta accounting
From: Andrea Righi
Date: Wed Jun 17 2026 - 04:00:34 EST
Hi Liang,
On Wed, Jun 17, 2026 at 02:48:05PM +0800, luoliang@xxxxxxxxxx wrote:
> From: Liang Luo <luoliang@xxxxxxxxxx>
>
> In try_pick_next_cgroup(), a cgroup is charged its full slice upfront.
> fcg_dispatch() is then supposed to refund the unused time by adding
> (expected_end - now) * FCG_HWEIGHT_ONE / hweight to cvtime_delta:
>
> __sync_fetch_and_add(&cgc->cvtime_delta,
> (cpuc->cur_at + cgrp_slice_ns - now) *
> FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
>
> However, if a cgroup yields early (expected_end > now), this adds a
> positive value to cvtime_delta, which ultimately increases cvtime,
> heavily penalizing the cgroup instead of refunding it.
>
> Conversely, if a cgroup overuses its slice (now > expected_end), the
> unsigned subtraction underflows to a massive u64. When multiplied by
> FCG_HWEIGHT_ONE and divided by hweight via unsigned division, it
> yields an enormous positive number. Adding this huge value to
> cvtime_delta pushes the cgroup's virtual time roughly 3.25 days into
> the future, permanently starving it.
>
> cgrp_cap_budget() later applies cvtime_delta to cvtime reading it as
> u64. Once dispatch can produce negative (refund) deltas, reading them
> as u64 turns every refund into a huge positive value with the same
> starving effect, so read and apply the delta as s64 and clamp the
> result at zero.
>
> Order the time delta as (now - expected_end) so overruns are positive
> and early yields negative, and split the magnitude scaling by sign
> since BPF has no signed division.
>
> Observed on a nested cgroup hierarchy under mixed CPU load: cvtime
> grew roughly 300x faster than wall-clock time, cvtime_delta reached
> 2^47..2^49, and tasks in cgroups stalled. After the change cvtime
> advances at the expected weight-scaled rate and no anomalous deltas
> appear.
>
> Fixes: a4103eacc2ab ("sched_ext: Add a cgroup scheduler which uses flattened hierarchy")
> Signed-off-by: Liang Luo <luoliang@xxxxxxxxxx>
> ---
> tools/sched_ext/scx_flatcg.bpf.c | 37 ++++++++++++++++++++++++++------
> 1 file changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
> index fec359581826..a37f5b876da8 100644
> --- a/tools/sched_ext/scx_flatcg.bpf.c
> +++ b/tools/sched_ext/scx_flatcg.bpf.c
> @@ -249,15 +249,26 @@ static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc)
>
> static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
> {
> - u64 delta, cvtime, max_budget;
> + s64 delta;
> + u64 cvtime, max_budget;
>
> /*
> * A node which is on the rbtree can't be pointed to from elsewhere yet
> * and thus can't be updated and repositioned. Instead, we collect the
> * vtime deltas separately and apply it asynchronously here.
> */
> - delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
> - cvtime = cgv_node->cvtime + delta;
> + delta = (s64)__sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta);
> + /*
> + * cvtime_delta may be negative (early-yield refund from fcg_dispatch()).
> + * Apply it in the signed domain and clamp the resulting cvtime at zero
> + * so that a refund larger than the accumulated cvtime doesn't roll the
> + * field negative.
> + */
> + s64 vtime = (s64)cgv_node->cvtime + delta;
> +
> + if (vtime < 0)
> + vtime = 0;
> + cvtime = (u64)vtime;
>
> /*
> * Allow a cgroup to carry the maximum budget proportional to its
> @@ -767,11 +778,25 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
> * We want to update the vtime delta and then look for the next
> * cgroup to execute but the latter needs to be done in a loop
> * and we can't keep the lock held. Oh well...
> + *
> + * Charge the overrun or refund the unused time. The original
> + * (cur_at + slice - now) was inverted: early yields were
> + * penalized and overruns underflowed. Use (now - (cur_at +
> + * slice)) so overruns are positive and early yields negative;
> + * split the magnitude scaling by sign since BPF has no signed
> + * division.
> */
> bpf_spin_lock(&cgv_tree_lock);
> - __sync_fetch_and_add(&cgc->cvtime_delta,
> - (cpuc->cur_at + cgrp_slice_ns - now) *
> - FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
> + s64 excess = (s64)now - (s64)(cpuc->cur_at + cgrp_slice_ns);
Maybe we can do:
u64 expected_end = cpuc->cur_at + cgrp_slice_ns;
s64 excess = (s64)(now - expected_end);
using the typical kernel time-comparison idiom.
Other than that, looks good to me.
Reviewed-by: Andrea Righi <arighi@xxxxxxxxxx>
Thanks,
-Andrea
> + s64 delta;
> +
> + if (excess > 0)
> + delta = (s64)((u64)excess * FCG_HWEIGHT_ONE /
> + (cgc->hweight ?: 1));
> + else
> + delta = -(s64)((u64)-excess * FCG_HWEIGHT_ONE /
> + (cgc->hweight ?: 1));
> + __sync_fetch_and_add(&cgc->cvtime_delta, delta);
> bpf_spin_unlock(&cgv_tree_lock);
> } else {
> stat_inc(FCG_STAT_CNS_GONE);
> --
> 2.43.0
>