Re: [RFC PATCH] sched/fair: update the vruntime to be max vruntime when yield

From: Xuewen Yan
Date: Tue Aug 22 2023 - 22:04:08 EST


Hi Vincent

Thanks for your patience to reply!

On Tue, Aug 22, 2023 at 11:55 PM Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> On Mon, 21 Aug 2023 at 09:51, Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:
> >
> > Hi Vincent
> >
> > I have some questions to ask,and hope you can help.
> >
> > For this problem, In our platform, We found that the vruntime of some
> > tasks will become abnormal over time, resulting in tasks with abnormal
> > vruntime not being scheduled.
> > The following are some tasks in runqueue:
> > [status: curr] pid: 25501 prio: 116 vrun: 16426426403395799812
> > [status: skip] pid: 25496 prio: 116 vrun: 16426426403395800756
> > exec_start: 326203047009312 sum_ex: 29110005599
> > [status: pend] pid: 25497 prio: 116 vrun: 16426426403395800705
> > exec_start: 326203047002235 sum_ex: 29110508751
> > [status: pend] pid: 25321 prio: 130 vrun: 16668783152248554223
> > exec_start: 0 sum_ex: 16198728
> > [status: pend] pid: 25798 prio: 112 vrun: 17467381818375696015
> > exec_start: 0 sum_ex: 9574265
> > [status: pend] pid: 22282 prio: 120 vrun: 18010356387391134435
> > exec_start: 0 sum_ex: 53192
> > [status: pend] pid: 24259 prio: 120 vrun: 359915144918430571
> > exec_start: 0 sum_ex: 20508197
> > [status: pend] pid: 25988 prio: 120 vrun: 558552749871164416
> > exec_start: 0 sum_ex: 2099153
> > [status: pend] pid: 21857 prio: 124 vrun: 596088822758688878
> > exec_start: 0 sum_ex: 246057024
> > [status: pend] pid: 26614 prio: 130 vrun: 688210016831095807
> > exec_start: 0 sum_ex: 968307
> > [status: pend] pid: 14229 prio: 120 vrun: 816756964596474655
> > exec_start: 0 sum_ex: 793001
> > [status: pend] pid: 23866 prio: 120 vrun: 1313723379399791578
> > exec_start: 0 sum_ex: 1507038
> > ...
> > [status: pend] pid: 25970 prio: 120 vrun: 6830180148220001175
> > exec_start: 0 sum_ex: 2531884
> > [status: pend] pid: 25965 prio: 120 vrun: 6830180150700833203
> > exec_start: 0 sum_ex: 8031809
> >
> > And According to your suggestion, we test the patch:
> > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@xxxxxxxxx/T/#u
> > The above exception is gone.
> >
> > But when we tested using patch:
> > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@xxxxxxxxx/T/#u
> > and
> > https://lore.kernel.org/all/20230317160810.107988-1-vincent.guittot@xxxxxxxxxx/
> > Unfortunately, our issue occurred again.
> >
> > So we have to use a workaround solution to our problem, that is to
> > change the sleeping time's judgement to 60s.
> > +
> > + sleep_time -= se->exec_start;
> > + if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
> > + return true;
> >
> > to
> >
> > + sleep_time -= se->exec_start;
> > +if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> > + return true;
> >
> > At this time, the issue also did not occur again.
> >
> > But this modification doesn't actually solve the real problem. And then
>
> yes, it resetx the task's vruntime once the delta go above 60sec but
> your problem is still there
>
> > Qais suggested us to try this patch:
> > https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@xxxxxxx/T/#u
>
> we have the below in v6.0 to fix the problem of stalled clock update
> instead of the above
> commit e2f3e35f1f5a ("sched/fair: Decay task PELT values during wakeup
> migration")
>
> Which kernel version are you using ?

We test in kernel5.4, and kernel5.15 also seems to have this problem.

And I will later test the commit e2f3e35f1f5a ("sched/fair: Decay
task PELT values during wakeup migration").

>
> >
> > And we tested the patch(android phone, monkey test with 60 apk, 7days).
> > It did not reproduce the previous problem.
> >
> > We would really appreciate it if you could take a look at the patch
> > and help see what goes wrong.
>
> I will look more deeply how your yield task and its vruntime can stay
> stalled so long
>

Thanks Vincent!

BR
> >
> > Thanks!
> > BR
> >
> > ---
> > xuewen
> >
> > On Fri, Jun 30, 2023 at 10:40 PM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > >
> > > Hi Xuewen
> > >
> > > On 03/01/23 16:20, Xuewen Yan wrote:
> > > > On Wed, Mar 1, 2023 at 4:09 PM Vincent Guittot
> > > > <vincent.guittot@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Hi Vincent
> > > > > >
> > > > > > I noticed the following patch:
> > > > > > https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@xxxxxxxxx/
> > > > > > And I notice the V2 had merged to mainline:
> > > > > > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@xxxxxxxxx/T/#u
> > > > > >
> > > > > > The patch fixed the inversing of the vruntime comparison, and I see
> > > > > > that in my case, there also are some vruntime is inverted.
> > > > > > Do you think which patch will work for our scenario? I would be very
> > > > > > grateful if you could give us some advice.
> > > > > > I would try this patch in our tree.
> > > > >
> > > > > By default use the one that is merged; The difference is mainly a
> > > > > matter of time range. Also be aware that the case of newly migrated
> > > > > task is not fully covered by both patches.
> > > >
> > > > Okay, Thank you very much!
> > > >
> > > > >
> > > > > This patch fixes a problem with long sleeping entity in the presence
> > > > > of low weight and always running entities. This doesn't seem to be
> > > > > aligned with the description of your use case
> > > >
> > > > Thanks for the clarification! We would try it first to see whether it
> > > > could resolve our problem.
> > >
> > > Did you get a chance to see if that patch help? It'd be good to backport it to
> > > LTS if it does.
> > >
> > >
> > > Thanks
> > >
> > > --
> > > Qais Yousef