Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()
From: Paul Turner
Date: Fri Mar 31 2017 - 05:59:43 EST
On Fri, Mar 31, 2017 at 12:01 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Thu, Mar 30, 2017 at 03:02:47PM -0700, Paul Turner wrote:
>> On Thu, Mar 30, 2017 at 7:14 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> > On Thu, Mar 30, 2017 at 02:16:58PM +0200, Peter Zijlstra wrote:
>> >> On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
>> >
>> >> > > +
>> >> > > + if (unlikely(periods >= LOAD_AVG_MAX_N))
>> >> > > return LOAD_AVG_MAX;
>> >
>> >> >
>> >> > Is this correct in the iterated periods > LOAD_AVG_MAX_N case?
>> >> > I don't think the decay above is guaranteed to return these to zero.
>> >>
>> >> Ah!
>> >>
>> >> Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates
>> >> to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so
>> >> 63 of those and we're 0.
>> >>
>> >> But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only
>> >> LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63.
>> >>
>> >> So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what
>> >> to do about that.
>> >
>> >
>> > So at the very least it should be decay_load(LOAD_AVG_MAX, 1) (aka
>> > LOAD_AVG_MAX - 1024), but that still doesn't account for the !0
>> > decay_load() of the first segment.
>> >
>> > I'm thinking that we can compute the middle segment, by taking the max
>> > value and chopping off the ends, like:
>> >
>> >
>> > p
>> > c2 = 1024 \Sum y^n
>> > n=1
>> >
>> > inf inf
>> > = 1024 ( \Sum y^n - \Sum y^n - y^0 )
>> > n=0 n=p
>> >
>> >
>>
>> So this is endemic to what I think is a deeper problem:
>
> I think not.
>
> 47742*e((345/32)*l(.5))
> 27.12819932019487579284
>
> So even without weight, 345 periods isn't enough to flatten
> LOAD_AVG_MAX.
>
>> The previous rounds of optimization folded weight into load_sum. I
>> think this introduced a number of correctness problems:
>
> I'll argue you on correctness; but these might well be problems indeed.
Yeah, this side of it will never be numerically perfect. I was just
suggesting that we can more easily clamp to LOAD_AVG_MAX directly if
weight is not accumulated into the sum.
>
>> a) the load_avg is no longer independent of weight; a high weight
>> entity can linger for eons. [63 LOAD_AVG_PERIODS]
>
> Certainly longer than before, agreed.
>
>> b) it breaks the dynamic response of load_avg on a weight change.
>> While nice is not common, there's a case that this is really important
>> for which is cgroups with a low number of threads running. E.g. When we
>> transition from 1->2 threads we immediately halve the weight, but
>> because of the folding it takes a very large time to be numerically
>> correct again.
>
> I think this is a matter of semantics, not correctness. We did have that
Challenge accepted ;)
> weight in the past, so our past average including that isn't incorrect
> per se.
I agree, but it's not correct relative to the numerical
interpretations we actually want to use. We examine these values for
forward-looking decisions, e.g. if we move this thread, how much load
are we moving and vruntime calculations.
E.g. Consider the case above, suppose we have 2 nice-0 threads, one
has been running on 0, one wakes up on 1. The load-balancer will see
a false imbalance between 0, 1 [as well as potentially other cores],
when they should be ~equal (with some obvious caveats or what the
threads are actually doing).
Further this weight is specifically bound to cpu 0. Meaning that we
will also see inconsistent future behavior, dependent on wake-up
placement.
>
> Similarly, our vruntime includes increments based on prior weight.
>
> Now; you're arguing that this is undesired, and this might well be.
Yes, this leads to cross and intra-group fairness issues, taking the
same example as above:
- The thread running on 0 will receive additional time versus the thread on 1
- However, the thread on 1 will be correctly weighted at 50% so
between 0 and 1 we'll see more time in competition with an
equivalently weighted external group than we should.
>
>> c) It doesn't work with scale_load_down and fractional shares below
>> SCHED_LOAD_SCALE [we multiply in a zero -> zero rq load]
>
> Yes, I noticed this, and this is indeed undesired.
>
>> d) It makes doing stability/clipping above a nightmare.
>
> Unsure what exactly you're referring to; the scale_down_load() boundary
> at 0? How is this a separate point from c then?
This is:
a) the clamp to LOAD_AVG_MAX above on overflow [clip]
b) waggle in weights as the number of runnable group entities
increases/decreases means that the sum does not converge nearly as
tightly, even with consistent behavior. [stability]
>
>> I think it's actually *costing* us cycles, since we end up multiplying
>> in the weight every partial [load_sum] update, but we only actually
>> need to compute it when updating load_avg [which is per period
>> overflow].
>
> So lets pull it out again -- but I don't think we need to undo all of
> yuyang's patches for that. So please, look at the patch I proposed for
> the problem you spotted. Lets fix the current state and take it from
> there, ok?
Oh, absolutely. I'm not really not proposing re-vulcanizing any
rubber here. Just saying that this particular problem is just one
facet of the weight mixing. 100% agree on fixing this as is and
iterating. I was actually thinking that these changes (which really
nicely simplify the calculation) greatly simplify restoring the
separation there.