Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

From: Pavan Kondeti
Date: Wed Oct 24 2018 - 00:53:20 EST

Next message: Khalid Aziz: "Re: [PATCH] hugetlbfs: dirty pages as they are added to pagecache"
Previous message: Mike Kravetz: "[PATCH RFC v2 0/1] hugetlbfs: Use i_mmap_rwsem for pmd share and fault/trunc"
Next in thread: Vincent Guittot: "Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Vincent,

Thanks for the detailed explanation.

On Tue, Oct 23, 2018 at 02:15:08PM +0200, Vincent Guittot wrote:
> Hi Pavan,
>
> On Tue, 23 Oct 2018 at 07:59, Pavan Kondeti <pkondeti@xxxxxxxxxxxxxx> wrote:
> >
> > Hi Vincent,
> >
> > On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
> > >
> > > /*
> > > + * The clock_pelt scales the time to reflect the effective amount of
> > > + * computation done during the running delta time but then sync back to
> > > + * clock_task when rq is idle.
> > > + *
> > > + *
> > > + * absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> > > + * @ max capacity ------******---------------******---------------
> > > + * @ half capacity ------************---------************---------
> > > + * clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16
> > > + *
> > > + */
> > > +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > > +{
> > > +
> > > + if (is_idle_task(rq->curr)) {
> > > + u32 divider = (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > > + u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> > > + overload += rq->avg_rt.util_sum;
> > > + overload += rq->avg_dl.util_sum;
> > > +
> > > + /*
> > > + * Reflecting some stolen time makes sense only if the idle
> > > + * phase would be present at max capacity. As soon as the
> > > + * utilization of a rq has reached the maximum value, it is
> > > + * considered as an always runnnig rq without idle time to
> > > + * steal. This potential idle time is considered as lost in
> > > + * this case. We keep track of this lost idle time compare to
> > > + * rq's clock_task.
> > > + */
> > > + if (overload >= divider)
> > > + rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > > +
> >
> > I am trying to understand this better. I believe we run into this scenario, when
> > the frequency is limited due to thermal/userspace constraints. Lets say
>
> Yes these are the most common UCs but this can also happen after tasks
> migration or with a cpufreq governor that doesn't increase OPP fast
> enough for current utilization.
>
> > frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
> > Fmax/2. The utilization is built up to 100% after several periods.
> > The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle time
> > all along. What happens when the CPU enters idle for a short duration and comes
> > back to run this 100% utilization task?
>
> If you are at 100%, we only apply the short idle duration
>
> >
> > If the above block is not present i.e lost_idle_time is not tracked, we
> > stretch the idle time (since clock_pelt is synced to clock_task) and the
> > utilization is dropped. Right?
>
> yes that 's what would happen. I gives more details below
>
> >
> > With the above block, we don't stretch the idle time. In fact we don't
> > consider the idle time at all. Because,
> >
> > idle_time = now - last_time;
> >
> > idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
> > idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - last_time
> > idle_time = rq->clock_pelt_old - last_time
> >
> > The last time is nothing but the last snapshot of the rq->clock_pelt when the
> > task entered sleep due to which CPU entered idle.
>
> The condition for dropping this idle time is quite important. This
> only happens when the utilization reaches max compute capacity of the
> CPU. Otherwise, the idle time will be fully applied

Right.

rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt

This not only tracks the lost idle time due to running slow but also the
absolute/real sleep time. For example, when the slow running 100% task
sleeps for 100 msec, are not we ignoring the 100 msec sleep there?

For example a task ran 323 msec at full capacity and sleeps for (1000-323)
msec. when it wakes up the utilization is dropped. If the same task runs
for 626 msec at the half capacity and sleeps for (1000-626), should not
drop the utilization by taking (1000-626) sleep time into account. I
understand that why we don't strech idle time to (1000-323) but it is not
clear to me why we completely drop the idle time.

>
> >
> > Can you please explain the significance of the above block with an example?
>
> The pelt signal reaches its max value after 323ms at full capacity,
> which means that we can't make any difference between tasks running
> 323ms, 500ms or more at max capacity. As a result, we consider that
> the CPU is fully used and there is no idle time when the utilization
> equals max capacity. If CPU runs at half the capacity, it will run
> 626ms before reaching max utilization and at that time we will stop to
> stretch the idle time because we consider that there is no idle time
> to stretch. By default, we never drop the idle time which is a
> necessary for being fully invariant and we always apply it. But we
> have to drop it when we consider that it would not have been present
> at max capacity too. That's all the purpose of the block that you
> mention

This is very much clear.

>
> Let take a task that runs 120 ms with a period of 330ms.
> At max capacity, task utilization will vary in the range [10-949]
> At half capacity, task will run 240ms and the range will stay the same
> as the idle time and the running time will be the same once stretched
> and scaled
> At one third of the capacity, task should run 360ms in a period of 330
> which means that the task will always run and will probably even lost
> some events as it will have not finished when the new period will
> start. In this case, the task/CPU utilization will reach the max value
> just like an always running task. As we can't make any difference
> anymore, we consider that there is no idle time to recover once the
> cpu will become idle and the block of code that you mention above will
> cancel the stretch of idle time.
>

Got it.

> >
> > > +
> > > + /* The rq is idle, we can sync to clock_task */
> > > + rq->clock_pelt = rq_clock_task(rq);
> > > +
> > > +
> > > + } else {
> > > + /*
> > > + * When a rq runs at a lower compute capacity, it will need
> > > + * more time to do the same amount of work than at max
> > > + * capacity: either because it takes more time to compute the
> > > + * same amount of work or because taking more time means
> > > + * sharing more often the CPU between entities.
> > > + * In order to be invariant, we scale the delta to reflect how
> > > + * much work has been really done.
> > > + * Running at lower capacity also means running longer to do
> > > + * the same amount of work and this results in stealing some
> > > + * idle time that will disturb the load signal compared to
> > > + * max capacity; This stolen idle time will be automaticcally
> > > + * reflected when the rq will be idle and the clock will be
> > > + * synced with rq_clock_task.
> > > + */
> > > +
> > > + /*
> > > + * scale the elapsed time to reflect the real amount of
> > > + * computation
> > > + */
> > > + delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));
> > > + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu_of(rq)));
> > > +
> > > + rq->clock_pelt += delta;
> >
> > AFAICT, the rq->clock_pelt is used for both utilization and load. So the load
> > also becomes a function of CPU uarch now. Is this intentional?
>
> yes, it is. Load is not scaled with uarch in current implementation
> because the load would cap by the max capacity of the local CPU and
> this mess up the load balance.
>
> Let take the example of CPU0 with max capacity of 1024 and CPU1 with
> max capacity of 512.
> We have 6 always running tasks with same nice priority
> Then, put 3 tasks on each CPU.
> If the load is scaled/capped with uarch, LB will consider the system
> balanced : 3*max_load / 1024 for CPU0 and 3*(max_load / 2) / 512 for
> CPU1. But tasks on CPU0 have twice more compute capacity than tasks on
> CPU1.
>
> With the new scaling, we don't have this problem anymore so we can
> take into account uarch and have more accurate load.
>
Got it.

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

Next message: Khalid Aziz: "Re: [PATCH] hugetlbfs: dirty pages as they are added to pagecache"
Previous message: Mike Kravetz: "[PATCH RFC v2 0/1] hugetlbfs: Use i_mmap_rwsem for pmd share and fault/trunc"
Next in thread: Vincent Guittot: "Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]