RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

From: Doug Smythies
Date: Tue Oct 29 2019 - 10:55:41 EST

On 2019.10.28 01:22 Vincent Guittot wrote:
> On Sat, 26 Oct 2019 at 08:59, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
>> On 2010.10.25 09:51 Vincent Guittot wrote:
>>> On Fri, 25 Oct 2019 at 17:55, Doug Smythies <doug.smythies@xxxxxxxxx> wrote:
>>>> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
>>>> which, in turn, was a re-apply of
>>>> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
>>>> after it was reverted via
>>>> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
>>>> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
>>>> incorrectly return TRUE, when the item should not be deleted from the list.
>>> The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
>> Not for my use case.
>> I applied Rik's patch to kernel 5.4-rc2 (since all my other reference
>> test data had been acquired against a base of 5.4-rc2). Tests results
>> are similar to the non-reverted kernel (results added in-line
>> below).
> Ok.
> I have studied a bit more your results below and IIUC your problem,
> some periodic wakes up (every 4sec) are missing with kernel 5.4-rc2

Actually, I don't know that the periodic wake ups are missing, I only
know that the intel_pstate CPU scaling driver is not being called.
This has been since kernel 5.1-rc1.
I bisected the kernel and found the patch that is this subject.
Then I used kernel 5.4-rc2 as my baseline for the data submitted.

Looking at the number of clocks cycles that have being used since the last
call to the driver suggests that the CPU has been doing something between
the long time between calls:
For the example given, there were 20,716,861 active clocks in 327.175
seconds. At 4 seconds per doing something that's 253,286 clocks each,
which is consistent (o.k. the variability is high) with actual data
(and, for example, see the aperf number of 231,813 clocks for the
4 second example given below).

> that helps cpuidle to enters deeper idle state after each new wake up
> until reaching the deepest state, isn't it ?

Well, it is the delay in the intel_pstate driver calls that is the
root issue. Dragging idle into it was just the use case example that
started this investigation.

> My 1st point is that this code doesn't use timer or hrtimer to wake up
> the system but only take advantage of the wake up of something else to
> update the blocked load. So I don't see how this patch could remove
> the 4sec periodic wakeup of the watchdog timer that you are
> mentioning.

I don't know that it is, as mentioned above.

> Then, when a system is idle and not used, the load should obviously be
> null most of the time and the update of decayed load should not happen
> anyway. It looks like you take advantage of some spurious and
> un-necessary wake up to help cpuidle to reach deeper idle state. Is
> this understanding correct ?

I don't know.

I only know that the call to the intel_pstate driver doesn't
happen, and that it is because cfs_rq_is_decayed returns TRUE.
So, I am asserting that the request is not actually decayed, and
should not have been deleted. Furthermore, I am wondering if other
tasks that should be run are suffering the same fate.

Now, if we also look back at the comments for the original commit:

"In an edge case where temporary cgroups were leaking, this
caused the kernel to consume good several tens of percents of
CPU cycles running update_blocked_averages(), each run taking
multiple millisecs."

To my way of thinking: Fix the leak, don't program around it; The
commit breaks something else, so revert it.

> Then, without or without removing the cfs_rq from the list, we should
> end up with rq->has_blocked_load == 0 and nohz.has_blocked == 0 too.
> The only main impact will be the duration of the loop that can be
> significantly shorter when you have a lot of rqs and cgroups.

I'm not following.

>>>> Signed-off-by: Doug Smythies <dsmythies@xxxxxxxxx>
>>>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>>> Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>>>> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>> Cc: sargun@xxxxxxxxx
>>>> Cc: tj@xxxxxxxxxx
>>>> Cc: xiexiuqi@xxxxxxxxxx
>>>> Cc: xiezhipeng1@xxxxxxxxxx
>>>> Cc: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
>>>> ---
>>>> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
>>>> and it's call and leaving the other changes have been tested. I do not know
>>>> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
>>>> Note 2: Previous controversy over this patch was based on heavy workloads,
>>>> but this is based on minimal or no workload, or "idle".
>>>> Where "idle" on my test server, with no gui and many services disabled,
>>>> tends to mean more "idle" than most systems.
>>>> Note 3: While this supporting data only involves the intel_pstate CPU
>>>> frequency scaling driver as a casualty, it is beyond my capabilities
>>>> to determine what other tasks that should be running might be omitted.
>>>> Use case example 1:
>>>> System Idle: The intel pstate CPU frequency scaling driver:
>>>> Mode: Active, non-hwp, powersave governor.
>>>> Expected behaviour: There is never ever a duration (time between calls to
>>>> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
>>>> Actual behaviour: There are long long gaps between calls to the driver:
>>>> Kernel: 5.4-rc2 CPU:7
>>>> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
>>>> mpref: 44023326
>>>> apref: 20716861
>>>> tsc: 1.11604E+12
>>>> load: 0
>>>> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
>>>> old pstate: 16 (the lowest for my processor)
>>>> new pstate: 16
>>>> Kernel: 5.4-rc2 + reversion (either method)
>>>> After several hours of testing, maximum durations were never more
>>>> than 4 seconds (well plus some jitter).
>>>> reversion method: max=4.07908 seconds
>>>> CPU:7
>>>> mperf: 492578
>>>> apref: 231813 (56,829 per second average is consistent with other tests)
>>>> tsc: 13914264074
>>>> load: 0
>>>> CPU frequency: 1.6052 GHz
>>>> old pstate: 16 (the lowest for my precessor)
>>>> new pstate: 16
>>>> On average, the non-reverted kernel executes the driver 25% less
>>>> than the reverted kernel during idle.
>> On (shorter)average, the Rik patched kernel executes the driver
>> 14% less than the reverted kernel during idle.
>> Longer and repeated testing would be required to determine if
>> this is a trend or simply non-repeatable noise.

The difference in probabilities of the issue occurring does appear to be
somewhat consistent. Not sure what it means, if anything.

>>>> O.K. so who cares, the requested pstate doesn't change?
>>>> First, one wonders if the math could overflow.
>>>> (although 7180ddd suggests maybe it won't)
>>>> Second, the sample is largely dominated by obsolete information.
>>>> Third, this can be problematic, and potentially wastes energy,
>>>> for the busy to idle transition.
>>>> Use case example 2:
>>>> The busy to idle transition:
>>>> Typically, the pstate request response to a busy to idle transition
>>>> is very slow because the duration suddenly goes from, typically,
>>>> 10 milliseconds to much much longer, up to 4 seconds. Transition
>>>> times to the system being fully idle, with all requested pstates
>>>> being at minimum, takes around 8 seconds with this reversion,
>>>> and, potentially, a very very long time (over 100 seconds has been
>>>> measured) without.
>>>> Again, so who cares, if the processor is in a deep idle state anyway,
>>>> not consuming much energy? O.K. but what if it is in an idle state
>>>> where energy consumption is a function of the requested pstate?
>>>> For example, for my processor (i7-2600K), idle state 1, then processor
>>>> package energy can be over double what it should be for many 10s of
>>>> seconds.
>>>> Experiment method:
>>>> enable only idle state 1
>>>> Dountil stopped
>>>> apply a 100% load (all CPUs)
>>>> after awhile (about 50 seconds) remove the load.
>>>> allow a short transient delay (1 second).
>>>> measure the processor package joules used over the next 149 seconds.
>>>> Enduntil
>>>> Kernel k5.4-rc2 + reversion (this method)
>>>> Average processor package power: 9.148 watts (128 samples, > 7 hours)
>>>> Minimum: 9.02 watts
>>>> Maximum: 9.29 watts
>>>> Note: outlyer data point group removed, as it was assumed the computer
>>>> had something to do and wasn't actually "idle".
>>>> Kernel 5.4-rc2:
>>>> Average processor package power: 9.969 watts (150 samples, > 8 hours)
>>>> Or 9% more energy for the idle phases of the work load.
>>>> Minimum: 9.15 watts
>>>> Maximum: 13.79 watts (51% more power)
>> Kernel 5.4-rc2 + Rik-patch:
>> Average processor package power: 9.85 watts (53 samples, < 3 hours)
>> Or 7.7% more energy for the idle phases of the work load.
>> Minimum: 9.23 watts
>> Maximum: 12.79 watts (40% more power)