Re: [PATCH v2] sched/eevdf: Prevent vlag from going out of bounds when reweight_eevdf

From: Chen Yu
Date: Mon Apr 22 2024 - 09:53:29 EST


On 2024-04-22 at 21:12:12 +0800, Xuewen Yan wrote:
> Hi peter,
>
> On Mon, Apr 22, 2024 at 7:17 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Apr 22, 2024 at 07:07:25PM +0800, Xuewen Yan wrote:
> > > On Mon, Apr 22, 2024 at 5:42 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Apr 22, 2024 at 04:33:37PM +0800, Xuewen Yan wrote:
> > > >
> > > > > On the Android system, the nice value of a task will change very
> > > > > frequently. The limit can also be exceeded.
> > > > > Maybe the !on_rq case is still necessary.
> > > > > So I'm planning to propose another patch for !on_rq case later after
> > > > > careful testing locally.
> > > >
> > > > So the scaling is: vlag = vlag * old_Weight / weight
> > > >
> > > > But given that integer devision is truncating, you could expect repeated
> > > > application of such scaling would eventually decrease the vlag instead
> > > > of grow it.
> > > >
> > > > Is there perhaps an invocation of reweight_task() missing? Looking at
> > >
> > > Is it necessary to add reweight_task in the prio_changed_fair()?
> >
> > I think that's the wrong place. Note how __setscheduler_params() already
> > has set_load_weight(). And all other callers of ->prio_changed() already
> > seem to do set_load_weight() as well.
> >
> > But that idle policy thing there still looks wrong, that sets the weight
> > very low but doesn't re-adjust anything.
>
> By adding a log to observe weight changes in reweight_entity, I found
> that calc_group_shares() often causes new_weight to become very small:
>

If I understand correctly, the on_rq matters when doing reweight.
In the following calltrace, after the entity(task group) is dequeued from
the tree, on_rq is 0, then subsequent update_cfs_group()->reweight_entity()
does not clamp the vlag because reweight_eevdf() can not be invoked, which could result in
the scaling(237238/2) of se->vlag quite large.

thanks,
Chenyu

> Hardware name: Unisoc UMS-base Board (DT)
> Call trace:
> dump_backtrace+0xec/0x138
> show_stack+0x18/0x24
> dump_stack_lvl+0x60/0x84
> dump_stack+0x18/0x24
> reweight_entity+0x3e8/0x5f4
> dequeue_task_fair+0x448/0x948
> dequeue_task+0xc4/0x398
> deactivate_task+0x1c/0x28
> pull_tasks+0x200/0x334
> newidle_balance+0x3cc/0x438
> pick_next_task_fair+0x58/0x670
> __schedule+0x204/0x9a0
> schedule+0x128/0x1a8
> schedule_timeout+0x44/0x1c8
> __skb_wait_for_more_packets+0xd0/0x17c
> __unix_dgram_recvmsg+0xdc/0x3a8
> unix_seqpacket_recvmsg+0x64/0x74
> __sys_recvfrom+0x14c/0x1e4
> __arm64_sys_recvfrom+0x24/0x38
> invoke_syscall+0x58/0x114
> el0_svc_common+0xac/0xe0
> do_el0_svc+0x1c/0x28
> el0_svc+0x3c/0x70
> el0t_64_sync_handler+0x68/0xbc
> el0t_64_sync+0x1a8/0x1ac
> reweight_entity: the lag=-831088603030 vruntime=2086205903
> limit=3071999998 old_weight=237238 new_weight=2