Re: [RFT][PATCH v7 6/8] sched: idle: Select idle state before stopping the tick

From: Thomas Ilsche
Date: Wed Mar 28 2018 - 04:39:26 EST


On 2018-03-28 10:13, Rafael J. Wysocki wrote:
On Wed, Mar 28, 2018 at 12:10 AM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
On Tuesday, March 27, 2018 11:50:02 PM CEST Thomas Ilsche wrote:
On 2018-03-20 16:45, Rafael J. Wysocki wrote:
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, reorder the
code in cpuidle_idle_call() so that the governor idle state selection
runs before tick_nohz_idle_go_idle() and use the "nohz" hint returned
by cpuidle_select() to decide whether or not to stop the tick.

This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_enter(). Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

I think something is wrong with the new tick_nohz_get_sleep_length.
It seems to return a value that is too large, ignoring immanent
non-sched timer.

That's a very useful hint, let me have a look.

I tested idle-loop-v7.3. It looks very similar to my previous results
on the first idle-loop-git-version [1]. Idle and traditional synthetic
powernightmares are mostly good.

OK

But it selects too deep C-states for short idle periods, which is bad
for power consumption [2].

That still needs to be improved, then.

I tracked this down with additional tests using
__attribute__((optimize("O0"))) menu_select
and perf probe. With this the behavior seems slightly different, but it
shows that data->next_timer_us is:
v4.16-rc6: the expected ~500 us [3]
idle-loop-v7.3: many milliseconds to minutes [4].
This leads to the governor to wrongly selecting C6.

Checking with 372be9e and 6ea0577, I can confirm that the change is
introduced by this patch.

Yes, that's where the most intrusive reordering happens.

Overall, this is an interesting conundrum, because the case in
question is when the tick should never be stopped at all during the
workload and the code's behavior in that case should not change, so
the change was not intentional.

Now, from walking through the code, as long as can_stop_idle_tick()
returns 'true' all should be fine or at least I don't see why there is
any difference in behavior in that case.

However, if can_stop_idle_tick() returns 'false' (for example, because
need_resched() returns 'true' when it is evaluated), the behavior *is*
different in a couple of ways. I sort of know how that can be
addressed, but I'd like to reproduce your results here.

Are you still using the same workload as before to trigger this behavior?


Yes, the exact code I use is as follows

$ gcc poller.c -O3 -fopenmp -o poller_omp
$ GOMP_CPU_AFFINITY=0-35 ./poller_omp 500

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
int sleep_us = 10000;
if (argc == 2) {
sleep_us = atoi(argv[1]);
}

#pragma omp parallel
{
while (1) {
usleep(sleep_us);
}
}
}