Re: [PATCH] sched/fair: Make PELT signal more accurate

From: Joel Fernandes
Date: Thu Aug 10 2017 - 11:22:46 EST

Next message: Anton Vasilyev: "[PATCH] hwmon:(stts751) buffer overread on wrong chip configuration"
Previous message: Suravee Suthikulpanit: "[RFC PATCH] sched/topology: Introduce NUMA identity node sched domain"
In reply to: Peter Zijlstra: "Re: [PATCH] sched/fair: Make PELT signal more accurate"
Next in thread: Joel Fernandes: "Re: [PATCH] sched/fair: Make PELT signal more accurate"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Aug 10, 2017 at 12:17 AM, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
> On 9 August 2017 at 19:51, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>> Hi Vincent,
>>
>> On Wed, Aug 9, 2017 at 3:23 AM, Vincent Guittot
>> <vincent.guittot@xxxxxxxxxx> wrote:
>> <snip>
>>>>
>>>> Yes this is true, however since I'm using the 'delta' instead of
>>>> period_contrib, its only does the update every 128us, however if
>>>> several updates fall within a 128us boundary then those will be rate
>>>> limited. So say we have a flood of updates, then the updates have to
>>>> be spaced every 128us to reach the maximum number of division, I don't
>>>> know whether this is a likely situation or would happen very often? I
>>>> am planning to run some benchmarks and check that there is no
>>>> regression as well as Peter mentioned about the performance aspect.
>>>>
>>>>>> In order to compare the signals with/without the patch I created a synthetic
>>>>>> test (20ms runtime, 100ms period) and analyzed the signals and created a report
>>>>>> on the analysis data/plots both with and without the fix:
>>>>>> http://www.linuxinternals.org/misc/pelt-error.pdf
>>>>>
>>>>> The glitch described in page 2 shows a decrease of the util_avg which
>>>>> is not linked to accuracy of the calculation but due to the use of the
>>>>> wrong range when computing util_avg.
>>>>
>>>> Yes, and I corrected the graphs this time to show what its like after
>>>> your patch and confirm that there is STILL a glitch. You are right
>>>> that there isn't a reduction after your patch, however in my updated
>>>> graphs there is a glitch and its not a downward peak but a stall in
>>>> the update, the error is still quite high and can be as high as the
>>>> absolute 2% error, in my update graphs I show an example where its ~
>>>> 1.8% (18 / 1024).
>>>>
>>>> Could you please take a look at my updated document? I have included
>>>> new graph and traces there and color coded them so its easy to
>>>> correlate the trace lines to the error in the graph: Here's the
>>>> updated new link:
>>>> https://github.com/joelagnel/joelagnel.github.io/blob/master/misc/pelt-error-rev2.pdf
>>>
>>> I see strange behavior in your rev2 document:
>>> At timestamp 9.235635, we have util_avg=199 acc_util_avg=199
>>> util_err=0. Everything looks fine but I don't this point on the graph
>>> Then, at 9.235636 (which is the red colored faulty trace), we have
>>> util_avg=182 acc_util_avg=200 util_err=18.
>>> Firstly, this means that util_avg has been updated (199 -> 182) so the
>>> error is not a problem of util_avg not been updated often enough :-)
>>> Then, util_avg decreases (199 -> 182) whereas it must increase because
>>> the task is still running. This should not happen and this is exactly
>>> what commit 625ed2bf049d should fix. So either the patch is not
>>> applied or it doesn't fix completely the problem.
>>
>> I think you are looking at wrong trace lines. The graph is generated
>> with for rq util only (cfs_rq == 1), so the lines in the traces you
>> should look at are the ones with cfs_rq= 1. Only cfs_rq==1 lines were
>> used to generate the graphs.
>
> Ah! this is quite confusing and not obvious that the trace is not for
> 1 signal but in fact 2 signals are interleaved and only 1 is displayed
> and that we have to filter them

Sorry, its my fault that I didn't make this fact clear enough in the
doc :-/. Thanks for your patience.

> So ok i can see that the trace with cfs_rq=1 is not updated. At the
> opposite, we can see that the other trace (for the se i assume) is
> updated normally whereas they are normally synced on the same clock

Ah, ok.
I also checked that the error can for the se as well, in following
example for cfs_rq=0 :

Lines filtered:

task_tick_fair: pelt_update: util_avg=359 load_avg=364
acc_load_avg=364 acc_util_avg=359 util_err=0 load_err=0
load_sum=17041923 sum_err=0 delta_us=977 cfs_rq=0 ret=1

task_tick_fair: pelt_update: util_avg=373 load_avg=377
acc_load_avg=377 acc_util_avg=373 util_err=0 load_err=0
load_sum=17656717 sum_err=0 delta_us=978 cfs_rq=0 ret=1

task_tick_fair: pelt_update: util_avg=373 load_avg=377
acc_load_avg=390 acc_util_avg=386 util_err=13 load_err=13
load_sum=18651021 sum_err=-994304 delta_us=971 cfs_rq=0

dequeue_task_fair: pelt_update: util_avg=396 load_avg=400
acc_load_avg=400 acc_util_avg=396 util_err=0 load_err=0
load_sum=18987624 sum_err=0 delta_us=720 cfs_rq=0 ret=1

So here we have 359, 373, 373, 396.

Lines unfiltered:

task_tick_fair: pelt_update: util_avg=349 load_avg=413
acc_load_avg=413 acc_util_avg=349 util_err=0 load_err=0
load_sum=19460993 sum_err=0 delta_us=980 cfs_rq=1 ret=1

task_tick_fair: pelt_update: util_avg=359 load_avg=364
acc_load_avg=364 acc_util_avg=359 util_err=0 load_err=0
load_sum=17041923 sum_err=0 delta_us=977 cfs_rq=0 ret=1

task_tick_fair: pelt_update: util_avg=363 load_avg=426
acc_load_avg=426 acc_util_avg=363 util_err=0 load_err=0
load_sum=20029072 sum_err=0 delta_us=977 cfs_rq=1 ret=1

task_tick_fair: pelt_update: util_avg=373 load_avg=377
acc_load_avg=377 acc_util_avg=373 util_err=0 load_err=0
load_sum=17656717 sum_err=0 delta_us=978 cfs_rq=0 ret=1

task_tick_fair: pelt_update: util_avg=376 load_avg=438
acc_load_avg=438 acc_util_avg=376 util_err=0 load_err=0
load_sum=20584978 sum_err=0 delta_us=978 cfs_rq=1 ret=1

task_tick_fair: pelt_update: util_avg=373 load_avg=377
acc_load_avg=390 acc_util_avg=386 util_err=13 load_err=13
load_sum=18651021 sum_err=-994304 delta_us=971 cfs_rq=0 ret=0

task_tick_fair: pelt_update: util_avg=389 load_avg=450
acc_load_avg=450 acc_util_avg=389 util_err=0 load_err=0
load_sum=21120780 sum_err=0 delta_us=971 cfs_rq=1 ret=1

dequeue_task_fair: pelt_update: util_avg=396 load_avg=400
acc_load_avg=400 acc_util_avg=396 util_err=0 load_err=0
load_sum=18987624 sum_err=0 delta_us=720 cfs_rq=0 ret=1

I hope to finish the perf tests today that Peter discussed and provide
an update,

thanks!

-Joel

>> In this you will see rq util_avg change as follows: 165 -> 182 -> 182
>> (missed an update causing error). This is also reflected in the graph
>> in the graph where you see the flat green line.
>>
>>>
>>> That would be interesting to also display the last_update_time of sched_avg
>>>
>>>>
>>>>> commit 625ed2bf049d "sched/cfs: Make util/load_avg more stable" fixes
>>>>> this glitch.
>>>>> And the lower peak value in page 3 is probably linked to the inaccuracy
>>>>
>>>> This is not true. The reduction in peak in my tests which happen even
>>>> after your patch is because of the dequeue that happens just before
>>>> the period boundary is hit. Could you please take a look at the
>>>> updated document in the link above? In there I show in the second
>>>> example with a trace that corresponds the reduction in peak during the
>>>> dequeue and is because of the delay in update. These errors go away
>>>> with my patch.
>>>
>>> There is the same strange behavior there:
>>> When the reduction in peak happens, the util_avg is updated whereas
>>> your concerns is that util_avg is not update often enough.
>>> At timestamp 10.656683, we have util_avg=389 acc_util_avg=389 util_err=0
>>> At timestamp 10.657420, we have util_avg=396 acc_util_avg=396
>>> util_err=0. I don't see this point on the graph
>>> At timestamp 10.657422, we have util_avg=389 acc_util_avg=399
>>> util_err=10. This is the colored faulty trace but util_avg has been
>>> updated from 369 to 389
>>
>> Yeah, same thing here, you should look at the lines with cfs_rq == 1.
>> The util changes as: 363 -> 376 -> 389 -> 389 (missed update).
>>
>>
>> thanks,
>>
>> -Joel
>>
>>
>>>
>>> Regards,
>>> Vincent
>>>
>>>>
>>>>> I agree that there is an inaccuracy (the max absolute value of 22) but
>>>>> that's in favor of less overhead. Have you seen wrong behavior because
>>>>> of this inaccuracy ?
>>>>
>>>> I haven't tried to nail this to a wrong behavior however since other
>>>> patches have been posted to fix inaccuracy and I do see we reach the
>>>> theoretical maximum error on quite a few occassions, I think its
>>>> justifiable. Also the overhead is minimal if updates aren't happening
>>>> several times in a window, and at 128us interval, and the few times
>>>> that the update does happen, the division is performed only during
>>>> those times. So incases where it does fix the error, it does so with
>>>> minimal overhead. I do agree with the overhead point and I'm planning
>>>> to do more tests with hackbench to confirm overhead is minimal. I'll
>>>> post some updates about it soon.
>>>>
>>>> Thanks!
>>>>
>>>> -Joel
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> With the patch, the error in the signal is significantly reduced, and is
>>>>>> non-existent beyond a small negligible amount.
>>>>>>
>>>>>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>>>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>>>>> Cc: Juri Lelli <juri.lelli@xxxxxxx>
>>>>>> Cc: Brendan Jackman <brendan.jackman@xxxxxxx>
>>>>>> Cc: Dietmar Eggeman <dietmar.eggemann@xxxxxxx>
>>>>>> Signed-off-by: Joel Fernandes <joelaf@xxxxxxxxxx>
>>>>>> ---
>>>>>> kernel/sched/fair.c | 8 ++++++--
>>>>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 4f1825d60937..1347643737f3 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -2882,6 +2882,7 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>>>>> unsigned long weight, int running, struct cfs_rq *cfs_rq)
>>>>>> {
>>>>>> u64 delta;
>>>>>> + int periods;
>>>>>>
>>>>>> delta = now - sa->last_update_time;
>>>>>> /*
>>>>>> @@ -2908,9 +2909,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>>>>> * accrues by two steps:
>>>>>> *
>>>>>> * Step 1: accumulate *_sum since last_update_time. If we haven't
>>>>>> - * crossed period boundaries, finish.
>>>>>> + * crossed period boundaries and the time since last update is small
>>>>>> + * enough, we're done.
>>>>>> */
>>>>>> - if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
>>>>>> + periods = accumulate_sum(delta, cpu, sa, weight, running, cfs_rq);
>>>>>> +
>>>>>> + if (!periods && delta < 128)
>>>>>> return 0;
>>>>>>
>>>>>> /*
>>>>>> --
>>>>>> 2.14.0.rc1.383.gd1ce394fe2-goog
>>>>>>
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxxx
>

Next message: Anton Vasilyev: "[PATCH] hwmon:(stts751) buffer overread on wrong chip configuration"
Previous message: Suravee Suthikulpanit: "[RFC PATCH] sched/topology: Introduce NUMA identity node sched domain"
In reply to: Peter Zijlstra: "Re: [PATCH] sched/fair: Make PELT signal more accurate"
Next in thread: Joel Fernandes: "Re: [PATCH] sched/fair: Make PELT signal more accurate"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]