Re: [RFC][PATCH] sched: attach extra runtime to the right avg

From: Peter Zijlstra
Date: Tue Jul 04 2017 - 08:40:15 EST


On Tue, Jul 04, 2017 at 02:21:50PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 04, 2017 at 12:13:09PM +0200, Ingo Molnar wrote:
> >
> > This code on the other hand:
> >
> > sa->last_update_time += delta << 10;
> >
> > ... in essence creates a whole new absolute clock value that slowly but surely is
> > drifting away from the real rq->clock, because 'delta' is always rounded down to
> > the nearest 1024 ns boundary, so we accumulate the 'remainder' losses.
> >
> > That is because:
> >
> > delta >>= 10;
> > ...
> > sa->last_update_time += delta << 10;
> >
> > Given enough time, ->last_update_time can drift a long way, and this delta:
> >
> > delta = now - sa->last_update_time;
> >
> > ... becomes meaningless AFAICS, because it's essentially two different clocks that
> > get compared.
>
> Thing is, once you drift over 1023 (ns) your delta increases and you
> catch up again.
>
>
>
> A B C D E F
> | | | | | |
> +----+----+----+----+----+----+----+----+----+----+----+
>
>
> A: now = 0
> sa->last_update_time = 0
> delta := (now - sa->last_update_time) >> 10 = 0
>
> B: now = 614 (+614)
> delta = (614 - 0) >> 10 = 0
> sa->last_update_time += 0 (0)
> sa->last_update_time = now & ~1023 (0)
>
> C: now = 1843 (+1229)
> delta = (1843 - 0) >> 10 = 1
> sa->last_update_time += 1024 (1024)
> sa->last_update_time = now & ~1023 (1024)
>
>
> D: now = 3481 (+1638)
> delta = (3481 - 1024) >> 10 = 2
> sa->last_update_time += 2048 (3072)
> sa->last_update_time = now & ~1023 (3072)
>
> E: now = 5734 (+2253)
> delta = (5734 - 3072) = 2
> sa->last_update_time += 2048 (5120)
> sa->last_update_time = now & ~1023 (5120)
>
> F: now = 6348 (+614)
> delta = (6348 - 5120) >> 10 = 1
> sa->last_update_time += 1024 (6144)
> sa->last_update_time = now & ~1023 (6144)
>
>
>
> And you'll see that both are identical, and that both D and F have
> gotten a spill from sub-chunk accounting.


Where the two approaches differ is when we have different modifications
to sa->last_update_time (and we do).

The differential (+=) one does not mandate initial value of
->last_update_time has the bottom 9 bits cleared. It will simply
continue from wherever.

The absolute (&) one however mandates that ->last_update_time always has
the bottom few bits 0, otherwise we can 'gain' time. The first iteration
will clear those bits and we'll then double account them.

It so happens that we have an explicit assign in migrate
(attach_entity_load_avg / set_task_rq_fair). And on negative delta. In
all those cases we use the immediate 'now' value, no clearing of bottom
bits.

The differential should work fine with that, the absolute one has double
accounting issues in that case.

So it would be very good to find what exactly causes Josef's workload to
get 'fixed'.