Re: [PATCH] sched/fair: Make PELT signal more accurate

From: Vincent Guittot
Date: Wed Aug 09 2017 - 06:24:04 EST


On 9 August 2017 at 01:11, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
> Hi Vincent,
>
> On Mon, Aug 7, 2017 at 6:24 AM, Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
>> Hi Joel,
>>
>> On 4 August 2017 at 17:40, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>>> The PELT signal (sa->load_avg and sa->util_avg) are not updated if the amount
>>> accumulated during a single update doesn't cross a period boundary. This is
>>> fine in cases where the amount accrued is much smaller than the size of a
>>> single PELT window (1ms) however if the amount accrued is high then the
>>> relative error (calculated against what the actual signal would be had we
>>> updated the averages) can be quite high - as much 3-6% in my testing. On
>>> plotting signals, I found that there are errors especially high when we update
>>> just before the period boundary is hit. These errors can be significantly
>>> reduced if we update the averages more often.
>>>
>>> Inorder to fix this, this patch does the average update by also checking how
>>> much time has elapsed since the last update and update the averages if it has
>>> been long enough (as a threshold I chose 128us).
>>
>> Why 128us and not 512us as an example ?
>
> I picked it because I see it shows a good reduction in the error to
> fewer occurrences.
>
>> 128us threshold means that util/load_avg can be computed 8 times more
>> often and this means up to 16 times more call to div_u64
>
> Yes this is true, however since I'm using the 'delta' instead of
> period_contrib, its only does the update every 128us, however if
> several updates fall within a 128us boundary then those will be rate
> limited. So say we have a flood of updates, then the updates have to
> be spaced every 128us to reach the maximum number of division, I don't
> know whether this is a likely situation or would happen very often? I
> am planning to run some benchmarks and check that there is no
> regression as well as Peter mentioned about the performance aspect.
>
>>> In order to compare the signals with/without the patch I created a synthetic
>>> test (20ms runtime, 100ms period) and analyzed the signals and created a report
>>> on the analysis data/plots both with and without the fix:
>>> http://www.linuxinternals.org/misc/pelt-error.pdf
>>
>> The glitch described in page 2 shows a decrease of the util_avg which
>> is not linked to accuracy of the calculation but due to the use of the
>> wrong range when computing util_avg.
>
> Yes, and I corrected the graphs this time to show what its like after
> your patch and confirm that there is STILL a glitch. You are right
> that there isn't a reduction after your patch, however in my updated
> graphs there is a glitch and its not a downward peak but a stall in
> the update, the error is still quite high and can be as high as the
> absolute 2% error, in my update graphs I show an example where its ~
> 1.8% (18 / 1024).
>
> Could you please take a look at my updated document? I have included
> new graph and traces there and color coded them so its easy to
> correlate the trace lines to the error in the graph: Here's the
> updated new link:
> https://github.com/joelagnel/joelagnel.github.io/blob/master/misc/pelt-error-rev2.pdf

I see strange behavior in your rev2 document:
At timestamp 9.235635, we have util_avg=199 acc_util_avg=199
util_err=0. Everything looks fine but I don't this point on the graph
Then, at 9.235636 (which is the red colored faulty trace), we have
util_avg=182 acc_util_avg=200 util_err=18.
Firstly, this means that util_avg has been updated (199 -> 182) so the
error is not a problem of util_avg not been updated often enough :-)
Then, util_avg decreases (199 -> 182) whereas it must increase because
the task is still running. This should not happen and this is exactly
what commit 625ed2bf049d should fix. So either the patch is not
applied or it doesn't fix completely the problem.

That would be interesting to also display the last_update_time of sched_avg

>
>> commit 625ed2bf049d "sched/cfs: Make util/load_avg more stable" fixes
>> this glitch.
>> And the lower peak value in page 3 is probably linked to the inaccuracy
>
> This is not true. The reduction in peak in my tests which happen even
> after your patch is because of the dequeue that happens just before
> the period boundary is hit. Could you please take a look at the
> updated document in the link above? In there I show in the second
> example with a trace that corresponds the reduction in peak during the
> dequeue and is because of the delay in update. These errors go away
> with my patch.

There is the same strange behavior there:
When the reduction in peak happens, the util_avg is updated whereas
your concerns is that util_avg is not update often enough.
At timestamp 10.656683, we have util_avg=389 acc_util_avg=389 util_err=0
At timestamp 10.657420, we have util_avg=396 acc_util_avg=396
util_err=0. I don't see this point on the graph
At timestamp 10.657422, we have util_avg=389 acc_util_avg=399
util_err=10. This is the colored faulty trace but util_avg has been
updated from 369 to 389

Regards,
Vincent

>
>> I agree that there is an inaccuracy (the max absolute value of 22) but
>> that's in favor of less overhead. Have you seen wrong behavior because
>> of this inaccuracy ?
>
> I haven't tried to nail this to a wrong behavior however since other
> patches have been posted to fix inaccuracy and I do see we reach the
> theoretical maximum error on quite a few occassions, I think its
> justifiable. Also the overhead is minimal if updates aren't happening
> several times in a window, and at 128us interval, and the few times
> that the update does happen, the division is performed only during
> those times. So incases where it does fix the error, it does so with
> minimal overhead. I do agree with the overhead point and I'm planning
> to do more tests with hackbench to confirm overhead is minimal. I'll
> post some updates about it soon.
>
> Thanks!
>
> -Joel
>
>
>>
>>>
>>> With the patch, the error in the signal is significantly reduced, and is
>>> non-existent beyond a small negligible amount.
>>>
>>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>> Cc: Juri Lelli <juri.lelli@xxxxxxx>
>>> Cc: Brendan Jackman <brendan.jackman@xxxxxxx>
>>> Cc: Dietmar Eggeman <dietmar.eggemann@xxxxxxx>
>>> Signed-off-by: Joel Fernandes <joelaf@xxxxxxxxxx>
>>> ---
>>> kernel/sched/fair.c | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 4f1825d60937..1347643737f3 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2882,6 +2882,7 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>> unsigned long weight, int running, struct cfs_rq *cfs_rq)
>>> {
>>> u64 delta;
>>> + int periods;
>>>
>>> delta = now - sa->last_update_time;
>>> /*
>>> @@ -2908,9 +2909,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>> * accrues by two steps:
>>> *
>>> * Step 1: accumulate *_sum since last_update_time. If we haven't
>>> - * crossed period boundaries, finish.
>>> + * crossed period boundaries and the time since last update is small
>>> + * enough, we're done.
>>> */
>>> - if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
>>> + periods = accumulate_sum(delta, cpu, sa, weight, running, cfs_rq);
>>> +
>>> + if (!periods && delta < 128)
>>> return 0;
>>>
>>> /*
>>> --
>>> 2.14.0.rc1.383.gd1ce394fe2-goog
>>>