RE: schedutil issue with serial workloads
From: Doug Smythies
Date: Sun Jun 07 2020 - 13:24:36 EST
On 2020.06.05 Rafael J. Wysocki wrote:
> On 6/4/2020 11:29 PM, Alexander Monakov wrote:
> > Hello,
>
> Hi,
>
> Let's make more people see your report.
>
> +Peter, Giovanni, Quentin, Juri, Valentin, Vincent, Doug, and linux-pm.
>
>> this is a question/bugreport about behavior of schedutil on serial workloads
>> such as rsync, or './configure', or 'make install'. These workloads are
>> such that there's no single task that takes a substantial portion of CPU
>> time, but at any moment there's at least one runnable task, and overall
>> the workload is compute-bound. To run the workload efficiently, cpufreq
>> governor should select a high frequency.
>>
>> Assume the system is idle except for the workload in question.
>>
>> Sadly, schedutil will select the lowest frequency, unless the workload is
>> confined to one core with taskset (in which case it will select the
>> highest frequency, correctly though somewhat paradoxically).
>
> That's because the CPU utilization generated by the workload on all CPUs
> is small.
>
> Confining it to one CPU causes the utilization of this one to grow and
> so schedutil selects a higher frequency for it.
>
>> This sounds like it should be a known problem, but I couldn't find any
>> mention of it in the documentation.
Yes, this issue is very well known, and has been discussed on this list
several times, going back many years (and I likely missed some of the
discussions). In recent years Giovanni's git "make test" has
been the "goto" example for this. From that test, which has run to run
variability due to disk I/O, I made some test that varys PIDs per second
verses time. Giovanni's recent work on frequency invariance made a huge
difference for the schedutil response to this type of serialized workflow.
For my part of it:
I only ever focused on a new PID per work packet serialized workflow;
Since my last testing on this subject in January, I fell behind with
system issues and infrastructure updates.
Your workflow example is fascinating and rather revealing.
I will make use of it moving forward. Thank you.
Yes, schedutil basically responds poorly as it did for PIDs/second
based workflow before frequency invariance, but...(digression follows)...
Typically, I merely set the performance governor whenever I know
I will be doing serialized workflow, or whenever I just want the
job done the fastest (i.e. kernel compile).
If I use performance mode (hwp disabled, either active or passive,
doesn't matter), then I can not get the CPU frequency to max,
even if I set:
$ grep . /sys/devices/system/cpu/intel_pstate/m??_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100
I have to increase EPB all way to 1 to get to max CPU frequency.
There also is extreme hysteresis, as I have to back to 9 for
the frequency to drop again.
The above was an i5-9600K. My much older i7-9600K, works fine
with default EPB of 6. I had not previously realized there was
so much difference between processors and EPB.
I don't have time to dig deeper right now, but will in future.
>> I was able to replicate the effect with a pair of 'ping-pong' programs
>> that get a token, burn some cycles to simulate work, and pass the token.
>> Thus, each program has 50% CPU utilization. To repeat my test:
>>
>> gcc -O2 pingpong.c -o pingpong
>> mkfifo ping
>> mkfifo pong
>> taskset -c 0 ./pingpong 1000000 < ping > pong &
>> taskset -c 1 ./pingpong 1000000 < pong > ping &
>> echo > ping
>>
>> #include <stdio.h>
>> #include <unistd.h>
>> int main(int argc, char *argv[])
>> {
>> unsigned i, n;
>> sscanf(argv[1], "%u", &n);
>> for (;;) {
>> char c;
>> read(0, &c, 1);
>> for (i = n; i; i--)
>> asm("" :: "r"(i));
>> write(1, &c, 1);
>> }
>> }
>>
>> Alexander
It was not obvious to me what the approximate work/sleep frequency would be for
your work flow. For my version of it I made the loop time slower on purpose, and
because I could merely adjust "N" to compensate. I measured 100 hertz work/sleep
frequency per CPU, but my pipeline is 6 instead of 2.
Just for the record, this is what I did:
doug@s18:~/c$ cat pingpong.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
unsigned i, n, k;
sscanf(argv[1], "%u", &n);
while(1) {
char c;
read(0, &c, 1);
for (i = n; i; i--){
k = i;
k = k++;
}
write(1, &c, 1);
}
}
Compiled with:
cc pingpong.c -o pingpong
and run with (on purpose, I did not force CPU affinity,
as I wanted schedutil to decide (when it was the
governor, at least)):
#! /bin/dash
#
# ping-pong-test Smythies 2019.06.06
# serialized workflow, but same PID.
# from Alexander, but modified.
#
# because I always forget from last time
killall pingpong
rm --force pong1
rm --force pong2
rm --force pong3
rm --force pong4
rm --force pong5
rm --force pong6
mkfifo pong1
mkfifo pong2
mkfifo pong3
mkfifo pong4
mkfifo pong5
mkfifo pong6
~/c/pingpong 1000000 < pong1 > pong2 &
~/c/pingpong 1000000 < pong2 > pong3 &
~/c/pingpong 1000000 < pong3 > pong4 &
~/c/pingpong 1000000 < pong4 > pong5 &
~/c/pingpong 1000000 < pong5 > pong6 &
~/c/pingpong 1000000 < pong6 > pong1 &
echo > pong1
To measure work/sleep frequency, I made a
version that would only run, say, 10,000 times
and timed it.
... Doug