Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler

From: Nicolas Pitre
Date: Mon Aug 18 2014 - 14:25:16 EST


On Mon, 18 Aug 2014, Preeti U Murthy wrote:

> On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
> > On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> >
> >> The goal of the power aware scheduling design is to integrate all
> >> policy, metrics and averaging into the scheduler. Today the
> >> cpu power management is fragmented and hence inconsistent.
> >>
> >> As a first step towards this integration, rid the cpuidle state management
> >> of the governors. Retain only the cpuidle driver in the cpu idle
> >> susbsystem which acts as an interface between the scheduler and low
> >> level platform specific cpuidle drivers. For all decision making around
> >> selection of idle states,the cpuidle driver falls back to the scheduler.
> >>
> >> The current algorithm for idle state selection is the same as the logic used
> >> by the menu governor. However going ahead the heuristics will be tuned and
> >> improved upon with metrics better known to the scheduler.
> >
> > I'd strongly suggest a different approach here. Instead of copying the
> > menu governor code and tweaking it afterwards, it would be cleaner to
> > literally start from scratch with a new governor. Said new governor
> > would grow inside the scheduler with more design freedom instead of
> > being strapped on the side.
> >
> > By copying existing code, the chance for cruft to remain for a long time
> > is close to 100%. We already have one copy of it, let's keep it working
> > and start afresh instead.
> >
> > By starting clean it is way easier to explain and justify additions to a
> > new design than convincing ourselves about the removal of no longer
> > needed pieces from a legacy design.
>
> Ok. The reason I did it this way was that I did not find anything
> grossly wrong in the current cpuidle governor algorithm. Of course this
> can be improved but I did not see strong reasons to completely wipe it
> away. I see good scope to improve upon the existing algorithm with
> additional knowledge of *the idle states being mapped to scheduling
> domains*. This will in itself give us a better algorithm and does not
> mandate significant changes from the current algorithm. So I really
> don't see why we need to start from scratch.

Sure the current algorithm can be improved. But it has its limitations
by design. And simply making it more topology aware wouldn't justify
moving it into the scheduler.

What we're contemplating is something completely integrated with the
scheduler where cpuidle and cpufreq (and eventually thermal management)
together are part of the same "governor" to provide global decisions on
all
fronts.

Not only should the next wake-up event be predicted, but also the
anticipated system load, etc. The scheduler may know that a given CPU
is unlikely to be used for a while and could call for the deepest
C-state right away without waiting for the current menu heuristic to
converge.

There is also Daniel's I/O latency tracking that could replace the menu
governor latency guessing, the later based on heuristics that could be
described as black magic.

And all this has to eventually be policed by a global performance/power
concern that should weight C-states, P-states and task placement
together and select the best combination (Morten's work).

Therefore the current menu algorithm won't do it. It simply wasn't
designed for that.

We'll have the opportunity to discuss this further tomorrow anyway.

> The primary issue that I found was that with the goal being power aware
> scheduler we must ensure that the possibility of a governor getting
> registered with cpuidle to choose idle states no longer will exist. The
> reason being there is just *one entity who will take this decision and
> there is no option about it*. This patch intends to bring the focus to
> this specific detail.

I think there is nothing wrong with having multiple governors being
registered. We simply decide at runtime via sysfs which one has control
over the low-level cpuidle drivers.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/