Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal
From: Morten Rasmussen
Date: Wed Jul 10 2013 - 07:16:44 EST
On Tue, Jul 09, 2013 at 05:58:55PM +0100, Arjan van de Ven wrote:
> On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
> > Hi,
> >
> > This patch set is an initial prototype aiming at the overall power-aware
> > scheduler design proposal that I previously described
> > <http://permalink.gmane.org/gmane.linux.kernel/1508480>.
> >
> > The patch set introduces a cpu capacity managing 'power scheduler' which lives
> > by the side of the existing (process) scheduler. Its role is to monitor the
> > system load and decide which cpus that should be available to the process
> > scheduler. Long term the power scheduler is intended to replace the currently
> > distributed uncoordinated power management policies and will interface a
> > unified platform specific power driver obtain power topology information and
> > handle idle and P-states. The power driver interface should be made flexible
> > enough to support multiple platforms including Intel and ARM.
> >
> I quickly browsed through it but have a hard time seeing what the
> real interface is between the scheduler and the hardware driver.
> What information does the scheduler give the hardware driver exactly?
> e.g. what does it mean?
>
> If the interface is "go faster please" or "we need you to be at fastest now",
> that doesn't sound too bad.
> But if the interface is "you should be at THIS number" that is pretty bad and
> not going to work for us.
It is the former.
The current power driver interface (which is far from complete)
basically allows the power scheduler to get the current P-state, the
maximum available P-state, and provide P-state change hints. The current
P-state is not the instantaneous P-state, but an average over some
period of time. Since last query would work. (I should have called it
avg instead of curr.) Knowing that and also the maximum available
P-state at that point in time (may change over time due to thermal or
power budget constraints) allows the power scheduler to reason about the
spare capacity of the cpus and decide whether a P-state change is enough
or if the load must be spread across more cpus.
The P-state change request allows the power scheduler to ask the power
driver to go faster or slower. I was initially thinking about having a
simple up/down interface, but realized that it would not be sufficient
as the power driver wouldn't necessarily know how much it should go up or
down. When the cpu load is decreasing the power scheduler should be able
to determine fairly accurately how much compute capacity that is needed.
So I think it makes sense to pass this information to the power driver.
For some platforms the power driver may use the P-state hint directly to
choose the next P-state. The schedpower cpufreq wrapper governor is an
example of this. Others may have much more sophisticated power drivers
that take platform specific constraints into account and select whatever
P-state they like. The intention is that the P-state request will return
the actual P-state selected by the power driver so the power scheduler
can act accordingly.
The power driver interface uses a cpu_power-like P-state abstraction to
avoid dealing with frequencies in the power scheduler.
>
> also, it almost looks like there is a fundamental assumption in the code
> that you can get the current effective P state to make scheduler decisions on;
> on Intel at least that is basically impossible... and getting more so with every generation
> (likewise for AMD afaics)
>
> (you can get what you ran at on average over some time in the past, but not
> what you're at now or going forward)
>
As described above, it is not a strict assumption. From a scheduler
point of view we somehow need to know if the cpus are truly fully
utilized (at their highest P-state) so we need to throw more cpus at the
problem (assuming that we have more than one task per cpu) or if we can
just go to a higher P-state. We don't need a strict guarantee that we
get exactly the P-state that we request for each cpu. The power
scheduler generates hints and the power driver gives us feedback on what
we can roughly expect to get.
> I'm rather nervous about calculating how many cores you want active as a core scheduler feature.
> I understand that for your big.LITTLE architecture you need this due to the asymmetry,
> but as a general rule for more symmetric systems it's known to be suboptimal by quite a
> real percentage. For a normal Intel single CPU system it's sort of the worst case you can do
> in that it leads to serializing tasks that could have run in parallel over multiple cores/threads.
> So at minimum this kind of logic must be enabled/disabled based on architecture decisions.
Packing clearly has to take power topology into account and do the right
thing for the particular platform. It is not in place yet, but will be
addressed. I believe it would make sense for dual cpu Intel systems to
pack at socket level? I fully understand that it won't make sense for
single cpu Intel systems or inside each cpu in dual cpu Intel system.
For ARM it depends on the particular implemention. For big.LITTLE where
you have two cpu clusters (big and little), which may have different
C-states. It may make sense to pack between clusters and inside one
cluster, but not the other. The power scheduler must be able to handle
this. The power driver should provide the necessary platform information
as part of the power topology.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/