On 06/07/2013 08:21 PM, Catalin Marinas wrote:
Take the cpuidle example, it uses the load average of the CPUs,
however this load average is currently controlled by the scheduler
(load balance). Rather than using a load average that degrades over
time and gradually putting the CPU into deeper sleep states, the
scheduler could predict more accurately that a run-queue won't have
any work over the next x ms and ask for a deeper sleep state from the
How will the scheduler know that there will not be work in the near
future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how
much work will come up. All it knows is the current load of the
runqueues and the nature of the task (thanks to the PJT's metric). It
can then match the task load to the cpu capacity and schedule the tasks
on the appropriate cpus.
I don't see what the problem is with the cpuidle governor waiting for
the load to degrade before putting that cpu to sleep. In my opinion,
putting a cpu to deeper sleep states should happen gradually.
Of course, you could export more scheduler information to cpuidle,
various hooks (task wakeup etc.) but then we have another framework,
cpufreq. It also decides the CPU parameters (frequency) based on the
load controlled by the scheduler. Can cpufreq decide whether it's
better to keep the CPU at higher frequency so that it gets to idle
quicker and therefore deeper sleep states? I don't think it has enough
information because there are at least three deciding factors
(cpufreq, cpuidle and scheduler's load balancing) which are not
Why not? When the cpu load is high, cpu frequency governor knows it has
to boost the frequency of that CPU. The task gets over quickly, the CPU
goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
sleep state gradually.
Meanwhile the scheduler should ensure that the tasks are retained on
that CPU,whose frequency is boosted and should not load balance it, so
that they can get over quickly. This I think is what is missing. Again
this comes down to the scheduler taking feedback from the CPU frequency
governors which is not currently happening.