Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

From: Vincent Guittot
Date: Mon Mar 20 2017 - 09:30:07 EST


On 20 March 2017 at 13:59, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
> On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@xxxxxxxxxx> wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases. The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>
> No, I'm not sure. That's why I said "may be". :-)
>
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> OK
>
> Anyway, the problem is that we have a 100% busy CPU which quite evidently
> is not reported as 100% busy by the metric we use.


>
> Now, if I was sure about the root cause, I would fix it rather than suggest
> workarounds.
>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU. The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> That's fair enough, but the assumption was that PELT would be sufficient for
> that. To me, it clearly isn't, so the assumption was incorrect.

So PELT is just about CFS utilization and we need metrics for other
sched class as well
I know that Juri and other people are working on the dealine part

In current kernel, the only available metric is rt_avg (and
scale_rt_capacity) which gives you the available compute capacity for
CFS and in some way the amount of capacity used by RT and deadline.
That would be interesting to see if the amount of capacity used by RT
tasks (i assume that you don't have deadline task in your use case)
is similar to the amount of underestimation of CFS

>
>> That would be good to be sure of what is running during this not
>> accounted time and find a way to account them
>
> Yes, I agree.
>
> I'm not sure if I can carry out full investigation of that any time soon, however.

If you can share traces as Patrick proposed, we could estimate how
much time/capacity is used by rt tasks and if this can explain the
underestimation.

>
> I sent this mostly to let everybody know that there was a problem and how it
> could be worked around. That's why it is an RFC.

Ok

>
>>
>> >>
>> >> That can be easily demonstrated by running kernel compilation on
>> >> a Sandy Bridge Intel processor, running turbostat in parallel with
>> >> it and looking at the values written to the MSR_IA32_PERF_CTL
>> >> register. Namely, the expected result would be that when all CPUs
>> >> were 100% busy, all of them would be requested to run in the maximum
>> >> P-state, but observation shows that this clearly isn't the case.
>> >> The CPUs run in the maximum P-state for a while and then are
>> >> requested to run slower and go back to the maximum P-state after
>> >> a while again. That causes the actual frequency of the processor to
>> >> visibly oscillate below the sustainable maximum in a jittery fashion
>> >> which clearly is not desirable.
>> >>
>> >> To work around this issue use the observation that, from the
>> >> schedutil governor's perspective, CPUs that are never idle should
>> >> always run at the maximum frequency and make that happen.
>> >>
>> >> To that end, add a counter of idle calls to struct sugov_cpu and
>> >> modify cpuidle_idle_call() to increment that counter every time it
>> >> is about to put the given CPU into an idle state. Next, make the
>> >> schedutil governor look at that counter for the current CPU every
>> >> time before it is about to start heavy computations. If the counter
>> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> >> the CPU has not been idle for at least that long and the governor
>> >> will choose the maximum frequency for it without looking at the PELT
>> >> metric at all.
>> >
>> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>>
>> I would not say that it's a PELT problem (at least based on current
>> description) but more that we don't track all
>> activities of CPU
>
> I generally agree. PELT does what it does.
>
> However, using PELT the way we do that in schedutil turns out to be problematic.

yes. PELT is not enough but just part of the equation

>
> Thanks,
> Rafael
>