Re: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems
From: Rafael J. Wysocki
Date: Mon Oct 15 2018 - 03:52:22 EST
Hi Doug,
On Sun, Oct 14, 2018 at 8:53 AM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
>
> Hi Rafael,
>
> I tried your TEO idle governor.
Thanks!
> On 2018.10.11 14:02 Rafael J. Wysocki wrote:
>
> ...[cut]...
>
> > It has been tested on a few different systems with a number of
> > different workloads and compared with the menu governor. In the
> > majority of cases the workloads performed similarly regardless of
> > the cpuidle governor in use, but in one case the TEO governor
> > allowed the workload to perform 75% better, which is a clear
> > indication that some workloads may benefit from using it quite
> > a bit depending on the hardware they run on.
>
> Could you supply more detail for the 75% better case, so that
> I can try to repeat the results on my system?
This was encryption on Skylake X, but I'll get more details on that later.
> ...[cut]...
>
> > It is likely to select the "polling" state less often than menu
> > due to the lack of the extra latency limit derived from the
> > predicted idle duration, so the workloads depending on that
> > behavior may be worse off (but less energy should be used
> > while running them at the same time).
>
> Yes, and I see exactly that with the 1 core pipe test: Less
> performance (~10%), but also less processor package power
> (~3%), compared to the 8 patch set results from the other day.
>
> The iperf test (running 3 clients at once) results were similar
> for both power and throughput.
>
> > Overall, it selects deeper idle states than menu more often, but
> > that doesn't seem to make a significant difference in the majority
> > of cases.
>
> Not always, that viscous powernightmare sweep test that I run used
> way way more processor package power and spent a staggering amount
> of time in idle state 0. [1].
Can you please remind me what exactly the workload is in that test?
>
> ... [cut]...
>
> > + * The sleep length is the maximum duration of the upcoming idle time of the
> > + * CPU and it is always known to the kernel. Using it alone for selecting an
> > + * idle state for the CPU every time is a viable option in principle, but that
> > + * might lead to suboptimal results if the other wakeup sources are more active
> > + * for some reason. Thus this governor estimates whether or not the CPU idle
> > + * time is likely to be significantly shorter than the sleep length and selects
> > + * an idle state for it in accordance with that, as follows:
>
> There is something wrong here, in my opinion, but I have not isolated exactly where
> by staring at the code.
> Read on.
>
> ... [cut]...
>
> > + * Assuming an idle interval every second tick, take the maximum number of CPU
> > + * wakeups regarded as recent to rougly correspond to 10 minutes.
> > + *
> > + * When the total number of CPU wakeups goes above this value, all of the
> > + * counters corresponding to the given CPU undergo a "decay" and the counting
> > + * of recent events stars over.
> > + */
> > +#define TEO_MAX_RECENT_WAKEUPS (300 * HZ)
>
> In my opinion, there are problems with this is approach:
>
> First, the huge huge range of possible times between decay events,
> anywhere from ~ a second to approximately a week.
> In an idle 1000 HZ system, at 2 idle entries per 4 second watchdog event:
> time = 300,000 wakes * 2 seconds/wake = 6.9 days
> Note: The longest single idle time I measured was 3.5 seconds, but that is
> always combined with a shorter one. Even using a more realistic, and
> just now measured, average value of 0.7 idles/second would be 2.4 days.
>
> Second: It leads to unpredictable behaviour, sometimes for a long time, until
> the effects of some previous work are completely flushed. And from the first
> point above, that previous work might have been days ago. In my case, and while
> doing this work, it resulted in non-repeatability of tests and confusion
> for awhile. Decay events are basically asynchronous to the actual tasks being
> executed. For data to support what I am saying I did the following:
> Do a bunch of times {
> Start the powernightmare sweep test.
> Abort after several seconds (enough time to flush filters
> and prefer idle state 0)
> Wait a random amount of time
> Start a very light work load, but such that the sleep time
> per work cycle is less than one tick
> Observe varying times until idle state 0 is not excessively selected.
> Anywhere from 0 to 17 minutes (the maximum length of test) was observed.
> }
>
> Additional information:
>
> Periodic workflow: I am having difficulty understanding an unexpected high
> number of idle entries/exits in steady state (i.e. once things have settled
> down and the filters have finally flushed) For example, a 60% work / 40% sleep
> at 500 hertz workflow seems to have an extra idle entry exit. Trace excerpt
> (edited, the first column is uSeconds since previous):
>
> 140 cpu_idle: state=4294967295 cpu_id=7
> 1152 cpu_idle: state=4 cpu_id=7 <<<< The expected ~1200 uSecs of work
> 690 cpu_idle: state=4294967295 cpu_id=7 <<<< Unexpected, Expected ~800 uSecs
> 18 cpu_idle: state=2 cpu_id=7 <<<< So this extra idle makes up the difference
> 138 cpu_idle: state=4294967295 cpu_id=7 <<<< But why is it there?
> 1152 cpu_idle: state=4 cpu_id=7 <<<< Repeat
> 690 cpu_idle: state=4294967295 cpu_id=7
> 13 cpu_idle: state=2 cpu_id=7
> 143 cpu_idle: state=4294967295 cpu_id=7
> 1152 cpu_idle: state=4 cpu_id=7 <<<< Repeat
> 689 cpu_idle: state=4294967295 cpu_id=7
> 19 cpu_idle: state=2 cpu_id=7
>
> Now compare with trace data for kernel 4.16-rc6 with the 9 patches
> from the other day (which is what I expect to see):
>
> 846 cpu_idle: state=4294967295 cpu_id=7
> 1150 cpu_idle: state=4 cpu_id=7 <<<< The expected ~1200 uSecs of work
> 848 cpu_idle: state=4294967295 cpu_id=7 <<<< The expected ~800 uSecs of idle
> 1152 cpu_idle: state=4 cpu_id=7 <<<< Repeat
> 848 cpu_idle: state=4294967295 cpu_id=7
> 1151 cpu_idle: state=4 cpu_id=7 <<<< Repeat
> 848 cpu_idle: state=4294967295 cpu_id=7
> 1152 cpu_idle: state=4 cpu_id=7 <<<< Repeat
> 848 cpu_idle: state=4294967295 cpu_id=7
> 1152 cpu_idle: state=4 cpu_id=7 <<<< Repeat
>
> Anyway, in the end we really only care about power. So for this test:
> Kernel 4.19-rc6 + 9 patches: 9.133 watts
> TEO (on top of 4.19-rc7):
> At start, high number of idle state 0 entries: 11.33 watts (+24%)
> After awhile, it shifted to idle state 1: 10.00 watts (+9.5%)
> After awhile, it shifted to idle state 2: 9.67 watts (+5.9%)
> That seemed to finally be a steady state scenario (at least for over 2 hours).
> Note: it was always using idle state 4 also.
>
> ...[snip]...
>
> > + /* Decay past events information. */
> > + for (i = 0; i < drv->state_count; i++) {
> > + cpu_data->states[i].early_wakeups_old += cpu_data->states[i].early_wakeups;
> > + cpu_data->states[i].early_wakeups_old /= 2;
> > + cpu_data->states[i].early_wakeups = 0;
> > +
> > + cpu_data->states[i].hits_old += cpu_data->states[i].hits;
> > + cpu_data->states[i].hits_old /= 2;
> > + cpu_data->states[i].hits = 0;
>
> I wonder if this decay rate is strong enough.
>
> Hope this helps.
Yes, it does, thank you!
Cheers,
Rafael