Re: [PATCH RFC 0/4] Scheduler idle notifiers and users

From: Amit Kucheria
Date: Tue Feb 21 2012 - 09:52:10 EST

Next message: Fabio Estevam: "[PATCH] dma: dmaengine: Distinguish between 'dmaengine: failed to get' messages"
Previous message: Steven Rostedt: "Re: RAS trace event proto"
In reply to: Pantelis Antoniou: "Re: [PATCH RFC 0/4] Scheduler idle notifiers and users"
Next in thread: Pantelis Antoniou: "Re: [PATCH RFC 0/4] Scheduler idle notifiers and users"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Feb 21, 2012 at 3:31 PM, Pantelis Antoniou
<panto@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Feb 21, 2012, at 2:56 PM, Peter Zijlstra wrote:
>
>> On Tue, 2012-02-21 at 14:38 +0200, Pantelis Antoniou wrote:
>>>
>>> If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler
>>> callbacks, we should place hooks into the thermal framework/PM as well.
>>>
>>> It will pretty common to have per core temperature readings, on most
>>> modern SoCs.
>>>
>>> It is quite conceivable to have a case with a multi-core CPU where due
>>> to load imbalance, one (or more) of the cores is running at full speed
>>> while the rest are mostly idle. What you want do, for best performance
>>> and conceivably better power consumption, is not to throttle either
>>> frequency or lowers voltage to the overloaded CPU but to migrate the
>>> load to one of the cooler CPUs.
>>>
>>> This affects CPU capacity immediately, i.e. you shouldn't schedule more
>>> load on a CPU that its too hot, since you'll only end up triggering thermal
>>> shutdown. The ideal solution would be to round robin
>>> the load from the hot CPU to the cooler ones, but not so fast that we lose
>>> due to the migration of state from one CPU to the other.
>>>
>>> In a nutshell, the processing capacity of a core is not static, i.e. it
>>> might degrade over time due to the increase of temperature caused by the
>>> previous load.
>>>
>>> What do you think?
>>
>> This is called core-hopping, and yes that's a nice goal, although I
>> would like to do that after we get the 'simple' bits up and running. I
>> suspect it'll end up being slightly more complex than we'd like to due
>> to the fact that the goal conflicts with wanting to aggregate things on
>> cpu0 due to cpu0 being special for a host of reasons.
>>
>>
>
> Hi Peter,
>
> Agreed. We need to get there step by step, and I think that per-task load tracking
> is the first one. We do have other metrics besides load that can influence the
> scheduler decisions, with the most obvious being power consumption.
>
> BTW, since we're going to the trouble of calculating per-task load with
> increased accuracy, how about having some thought of translating the load numbers
> in an absolute format. I.e. with the CPUs now having fluctuating performance
> (due to cpufreq etc.) one would say that each CPU would have an X bogomips
> (or some else absolute) capacity per OPP. Perhaps having such a bogomips number
> calculated per-task would make things easier.
>
> Perhaps the same can be done with power/energy, i.e. have a per-task power
> consumption figure that we can use for scheduling, according to the available
> power budget per CPU.
>
> Dunno, it might not be feasible ATM, but having a power-aware scheduler would
> assume some kind of power measurement, no?

No please. We don't want to document ADC requirements, current probe
specs and sampling rates to successfully run the Linux kernel. :)

But from the scheduler mini-summit, there is acceptance that we need
to pass *some* knowledge of CPU characteristics to Linux. These need
to be distilled down to a few that guide scheduler policy e.g. power
cost of using a core. This in turn would influence the scheduler's
spread or gather decision (better to consolidate task onto few cores
or spread them out at low frequencies). Manufacturing processes and
CPU architecture obviously play a role in the differences here.
However, I don't expect unit for these parameters to be in mW. :)

/Amit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Fabio Estevam: "[PATCH] dma: dmaengine: Distinguish between 'dmaengine: failed to get' messages"
Previous message: Steven Rostedt: "Re: RAS trace event proto"
In reply to: Pantelis Antoniou: "Re: [PATCH RFC 0/4] Scheduler idle notifiers and users"
Next in thread: Pantelis Antoniou: "Re: [PATCH RFC 0/4] Scheduler idle notifiers and users"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]