Re: [RFC 08/14] sched/tune: add detailed documentation

From: Steve Muckle
Date: Mon Sep 14 2015 - 16:01:21 EST


Hi Patrick,

On 09/11/2015 04:09 AM, Patrick Bellasi wrote:
>> It's also worth noting that mobile vendors typically add all sorts of
>> hacks on top of the existing cpufreq governors which further complicate
>> policy.
>
> Could it be that many of the hacks introduced by vendors are just
> there to implement a kind of "scenario based" tuning of governors?
> I mean, depending on the specific use-case they try to refine the
> value of exposed tunables to improve either performance,
> responsiveness or power consumption?

>From what I've seen I think it's both scenario based tuning (add
functionality to detect and improve power/perf for say web browsing or
mp3 playback usecases specifically), as well as tailoring general case
behavior. Some of these are actually new features in the governor though
as opposed to just tweaks of existing tunables.

> If this is the case, it means that the currently available governors
> are missing an important bit of information: what are the best
> tunables values for a specific (set of) tasks?

Agreed, though I also think those tunable values might also change for a
given set of tasks in different circumstances.

>
>> The current proposal:
>>
>> * sched-dvfs/schedtune: Event driven, CPU usage calculated using
>> exponential moving average. AFAICS tries to maintain some % of idle
>> headroom, but if that headroom doesn't exist at task_tick_fair(), goes
>> to max frequency. Schedtune provides a way to boost/inflate the demand
>> of individual tasks or overall system demand.
>
> That's quite of a good description. One small correction is that, at
> least in the implementation presented by this RFC, SchedTune is not
> boosting individual tasks but just the CPU usage.
> The link with tasks is just that SchedTune knows how much to boost a
> CPU usage by keeping track of which tasks are runnable on that CPU.
> However, the utilization signal of each task is not actually modified
> from the scheduler standpoint.

Ah yes I see what you mean. I was thinking of the cgroup stuff but I see
that max per-task boost is tracked per-CPU and that CPU's aggregate
usage is boosted accordingly.

>> This looks a bit like ondemand to me but without the
>> sampling_down_factor functionality and using per-entity load tracking
>> instead of a simple window-based aggregate CPU usage.
>
> I agree in principle.
> An important difference worth to notice is that we use an "event
> based" approach. This means that an enqueue/dequeue can trigger
> an immediate OPP change.
> If you consider that commonly ondemand uses a 20ms sample rate while
> an OPP switch never requires (quite likely) more than 1 or 2 ms, this
> means that sched-DVFS can be much more reactive on adapting to
> variable loads.

"Can be" are the important words to me there... it'd be nice to be able
to control that. Aggressive frequency changes may not be desirable for
power or performance, even if the transition can be quickly completed.
The configuration values of min_sample_time and above_hispeed_delay in
the interactive governor on some recent devices may give clues as to
whether latency is being intentionally increased on various platforms.

The latency/reactiveness of CPU frequency changes are also IMO a product
of two things - the CPUfreq/sched-dvfs policy, and the task load
tracking algorithm. I don't have enough experience with the mainline
task load tracking algorithm yet to know how it will compare with the
window-based aggregate CPU usage metric used by mainline cpufreq
governors. But I would imagine it will smooth out some of the aggressive
nature of sched-dvfs' event-driven approach. The hardcoded values in the
task load tracking algorithm seem concerning though from a tuning
standpoint.

>> The interactive functionality would require additional knobs. I
...
> However, regarding specifically the latency on OPP changes, there are
> a couple of extension we was thinking about:
> 1. link the SchedTune boost value with the % of idle headroom which
> triggers an OPP increase
> 2. use the SchedTune boost value to defined the high frequency to jump
> at when a CPU crosses the % of idle headroom

Hmmm... This may be useful (only testing/profiling would tell) though it
may be nice to be able to tune these values.

> These are tunables which allows to parameterize the way the PELT
> signal for CPU usage is interpreted by the sched-DVFS governor.
>
> How such tunables should be exposed and tuned is to be discussed.
> Indeed, one of the main goals of the sched-DVFS and SchedTune
> specifically, is to simplify the tuning of a platform by exposing to
> userspace a reduced number of tunables, preferably just one.

This last point (the desire for a single tunable) is perhaps at the root
of my main concern. There are users/vendors for whom the current
tunables are insufficient, resulting in their hacking the governors to
add more tunables or features in the policy.

Consolidating CPU frequency and idle management in the scheduler will
clean things up and probably make things more effective, but I don't
think it will remove the need for a highly configurable policy.

I'm curious about the drive for one tunable. Is that something there's
specifically been a broad call for? Don't get me wrong, I'm all for
simplification and cleanup, if the flexibility and used features can be
retained.

>> A separate but related concern - in the (IMO likely, given the above)
>> case that folks want to tinker with that policy, it now means they're
>> hacking the scheduler as opposed to a self-contained frequency policy
>> plugin.
>
> I do not agree on that point. SchedTune, as well as sched-DVFS, are
> framework quit well separated from the scheduler.
> They are "consumers" of signals usually used by the scheduler, but
> they are not directly affecting scheduler decisions (at least in the
> implementation proposed by this RFC).

Agreed it's not affecting scheduler decision making (not directly). It's
more just the mixing of the policy into the same code, as margin is
added in enqueue_task_fair()/task_tick_fair() etc. That one in
particular would probably be easy to solve. A more difficult one is if
someone wants to make adjustments to the load tracking algorithm because
it is driving CPU frequency.

> Side effects are possible, of course. For example the selection of an
...
> However, one of the main goals of this proposal is to respond to a
> couple of long lasting demands (e.g. [1,2]) for:
> 1. a better integration of CPUFreq with the scheduler, which has all
> the required knowledge about workloads demands to target both
> performances and energy efficiency
> 2. a simple approach to configure a system to care more about
> performance or energy-efficiency
>
> SchedTune addresses mainly the second point. Once SchedTune is
> integrated with EAS it will provide a support to decide, in an
> energy-efficient way, how much we want to reduce power or boost
> performances.

The provided links definitely establish the need for (1) but I am still
wondering about the motivation for (2), because I don't think it's going
to be possible to boil everything down to a single slider tunable
without losing flexibility/functionality.

cheers,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/