Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

From: Vincent Guittot
Date: Thu Mar 23 2017 - 18:08:47 EST


On 23 March 2017 at 00:56, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
> On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
> <patrick.bellasi@xxxxxxx> wrote:
>> On 20-Mar 09:26, Vincent Guittot wrote:
>>> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@xxxxxxxxxx> wrote:
>>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>>> >> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>>> >>
>>> >> The PELT metric used by the schedutil governor underestimates the
>>> >> CPU utilization in some cases. The reason for that may be time spent
>>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>>
>>> Are you sure of the root cause described above (time stolen by irq
>>> handler) or is it just a hypotheses ? That would be good to be sure of
>>> the root cause
>>> Furthermore, IIRC the time spent in irq context is also accounted as
>>> run time for the running cfs task but not RT and deadline task running
>>> time
>>
>> As long as the IRQ processing does not generate a context switch,
>> which is happening (eventually) if the top half schedule some deferred
>> work to be executed by a bottom half.
>>
>> Thus, me too I would say that all the top half time is accounted in
>> PELT, since the current task is still RUNNABLE/RUNNING.
>
> Sorry if I'm missing something but doesn't this depend on whether you
> have CONFIG_IRQ_TIME_ACCOUNTING enabled?
>
> __update_load_avg uses rq->clock_task for deltas which I think
> shouldn't account IRQ time with that config option. So it should be
> quite possible for IRQ time spent to reduce the PELT signal right?
>
>>
>>> So I'm not really aligned with the description of your problem: PELT
>>> metric underestimates the load of the CPU. The PELT is just about
>>> tracking CFS task utilization but not whole CPU utilization and
>>> according to your description of the problem (time stolen by irq),
>>> your problem doesn't come from an underestimation of CFS task but from
>>> time spent in something else but not accounted in the value used by
>>> schedutil
>>
>> Quite likely. Indeed, it can really be that the CFS task is preempted
>> because of some RT activity generated by the IRQ handler.
>>
>> More in general, I've also noticed many suboptimal freq switches when
>> RT tasks interleave with CFS ones, because of:
>> - relatively long down _and up_ throttling times
>> - the way schedutil's flags are tracked and updated
>> - the callsites from where we call schedutil updates
>>
>> For example it can really happen that we are running at the highest
>> OPP because of some RT activity. Then we switch back to a relatively
>> low utilization CFS workload and then:
>> 1. a tick happens which produces a frequency drop
>
> Any idea why this frequency drop would happen? Say a running CFS task
> gets preempted by RT task, the PELT signal shouldn't drop for the
> duration the CFS task is preempted because the task is runnable, so

utilization only tracks the running state but not runnable state.
Runnable state is tracked in load_avg

> once the CFS task gets CPU back, schedutil should still maintain the
> capacity right?
>
> Regards,
> Joel