Re: [PATCH 03/15] sched/fair: Add lag based placement

From: Peter Zijlstra
Date: Fri Feb 07 2025 - 06:12:15 EST


On Fri, Feb 07, 2025 at 02:07:18AM -0800, Breno Leitao wrote:
> Hello Peter,
>
> On Wed, May 31, 2023 at 01:58:42PM +0200, Peter Zijlstra wrote:
> >
> > place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > {
> <snip>
> > - vruntime -= thresh;
> > + lag *= load + se->load.weight;
> > + if (WARN_ON_ONCE(!load))
>
> I have 6.13 running on some hosts, and in some cases, where the system
> is getting some OOMs, I see the following stack:
>
> WARNING: CPU: 29 PID: 593474 at kernel/sched/fair.c:5250 place_entity+0x199/0x1b0
>
> Call Trace:
> <TASK>
> ? place_entity+0x199/0x1b0
> reweight_entity+0x188/0x200
> enqueue_task_fair.llvm.15448040313737105663+0x28c/0x560
> enqueue_task+0x30/0x120
> ttwu_do_activate+0x99/0x230
> try_to_wake_up+0x25a/0x4a0
> ? hrtimer_dummy_timeout+0x10/0x10
> hrtimer_wakeup+0x25/0x30
> __hrtimer_run_queues+0xf1/0x250
> hrtimer_interrupt+0xfb/0x220
> __sysvec_apic_timer_interrupt+0x47/0x140
> sysvec_apic_timer_interrupt+0x35/0x80
> asm_sysvec_apic_timer_interrupt+0x16/0x20
>
> I am sorry for not decoding the stack, but I am having a hard time
> decoding the stack properly. The values I got was misleading, and I am
> working to understand what is happening.
>
> Anyway, I don't have a reproducer and this problem doesn't happen
> frequent enough. I have 1K hosts with 6.13 and I saw it 5 times in the
> last week.

Weird. Would you mind trying with the below patch on top?

---
Subject: sched/fair: Adhere to place_entity() constraints
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Tue, 28 Jan 2025 15:39:49 +0100

Mike reports that commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") relies on commit 4423af84b297
("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not
trip a WARN in place_entity().

What happens is that the lag of the very last entity is 0 per
definition -- the average of one element matches the value of that
element. Therefore place_entity() will match the condition skipping
the lag adjustment:

if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {

Without the 'se->vlag' condition -- it will attempt to adjust the zero
lag even though we're inserting into an empty tree.

Notably, we should have failed the 'cfs_rq->nr_queued' condition, but
don't because they didn't get updated.

Additionally, move update_load_add() after placement() as is
consistent with other place_entity() users -- this change is
non-functional, place_entity() does not use cfs_rq->load.

Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Reported-by: Mike Galbraith <efault@xxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Link: https://lkml.kernel.org/r/20250128143949.GD7145@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3781,6 +3781,7 @@ static void reweight_entity(struct cfs_r
update_entity_lag(cfs_rq, se);
se->deadline -= se->vruntime;
se->rel_deadline = 1;
+ cfs_rq->nr_queued--;
if (!curr)
__dequeue_entity(cfs_rq, se);
update_load_sub(&cfs_rq->load, se->load.weight);
@@ -3807,10 +3808,11 @@ static void reweight_entity(struct cfs_r

enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
- update_load_add(&cfs_rq->load, se->load.weight);
place_entity(cfs_rq, se, 0);
+ update_load_add(&cfs_rq->load, se->load.weight);
if (!curr)
__enqueue_entity(cfs_rq, se);
+ cfs_rq->nr_queued++;

/*
* The entity's vruntime has been adjusted, so let's check