Re: [RFC PATCH 3/3] idle: store the idle state index in the structrq

From: Nicolas Pitre
Date: Mon Feb 03 2014 - 09:58:26 EST


On Mon, 3 Feb 2014, Morten Rasmussen wrote:

> On Fri, Jan 31, 2014 at 06:19:26PM +0000, Nicolas Pitre wrote:
> > A cluster should map naturally to a scheduling domain. If we need to
> > wake up a CPU, it is quite obvious that we should prefer an idle CPU
> > from a scheduling domain which load is not zero. If the load is not
> > zero then this means that any idle CPU in that domain, even if it
> > indicated it was ready for a cluster power down, will not require the
> > cluster power-up latency as some other CPUs must still be running. But
> > we already know that of course even if the recorded latency might not
> > say so.
> >
> > In other words, the hardware latency information is dynamic of course.
> > But we might not _need_ to have it reflected at the scheduler domain all
> > the time as in this case it can be inferred by the scheduling domain
> > load.
>
> I agree that the existing sched domain hierarchy should be used to
> represent the power topology. But, it is not clear to me how much we can say
> about the C-state of cpu without checking the load of the entire cluster
> every time?
>
> We would need to know which C-states (index) that are per cpu and per
> cluster and ignore the cluster states when the cluster load is non-zero.

In any case i.e. whether the cluster load is zero or not, we want to
select the CPU to wake up with the shallowest C-state. That should
correspond to the actual cluster C-state already without having to track
it explicitly.

> Current sched domain load is not maintained in the scheduler, it is only
> produced when needed. But I guess you could derive the necessary
> information from the idle cpu masks.

Even better.

> > Within a scheduling domain it is OK to pick up the best idle CPU by
> > looking at the index as it is best to leave those CPUs ready for a
> > cluster power down set to that state and prefer one which is not. And a
> > scheduling domain with a load of zero should be left alone if idle CPUs
> > are found in another domain which load is not zero, irrespective of
> > absolute latency information. So all the existing heuristics already in
> > place to optimize cache utilization and so on will make things just work
> > for idle as well.
>
> IIUC, you propose to only use the index when picking an idle cpu inside
> an already busy sched domain and leave idle sched domains alone if
> possible. It may work for homogeneous SMP systems, but I don't think it
> will work for heterogeneous systems like big.LITTLE.

Hence the caveat "everything else being equal" I said previously.

> If the little cluster has zero load and the big has stuff running, it
> doesn't mean that it is a good idea to wake up another big cpu. It may
> be more power efficient to wake up the little cluster. Comparing idle
> state index of a big and little cpu won't help us in making that choice
> as the clusters may have different idle states and the costs associated
> with each state are different.

Agreed. But let's evolve this in manageable steps.

> I'm therefore not convinced that idle state index is the right thing to
> give the scheduler. Using a cost metric would be better in my
> opinion.

That won't be difficult to move from the idle state index to some other
cost metric once we've proven the simple index on homogeneous systems
has benefits.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/