Re: Plumbers: Tweaking scheduler policy micro-conf RFP

From: Morten Rasmussen
Date: Fri May 18 2012 - 12:24:48 EST


On Fri, May 18, 2012 at 05:18:17PM +0100, Morten Rasmussen wrote:
> On Tue, May 15, 2012 at 04:35:41PM +0100, Peter Zijlstra wrote:
> > On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
> > > On 15 May 2012 15:00, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> > > >>
> > > >> Not sure that nobody cares but it's much more that scheduler,
> > > >> load_balance and sched_mc are sensible enough that it's difficult to
> > > >> ensure that a modification will not break everything for someone
> > > >> else.
> > > >
> > > > Thing is, its already broken, there's nothing else to break :-)
> > > >
> > >
> > > sched_mc is the only power-aware knob in the current scheduler. It's
> > > far from being perfect but it seems to work on some ARM platform at
> > > least. You mentioned at the scheduler mini-summit that we need a
> > > cleaner replacement and everybody has agreed on that point. Is anybody
> > > working on it yet ?
> >
> > Apparently not..
> >
> > > and can we discuss at Plumber's what this replacement would look like ?
> >
> > one knob: sched_balance_policy with tri-state {performance, power, auto}
>
> Interesting. What would the power policy look like? Would performance
> and power be the two extremes of the power/performance trade-off? In
> that case I would assume that most embedded systems would be using auto.
>
> >
> > Where auto should likely look at things like are we on battery and
> > co-ordinate with cpufreq muck or whatever.
> >
> > Per domain knobs are insane, large multi-state knobs are insane, the
> > existing scheme is therefore insane^2. Can you find a sysad who'd like
> > to explore 3^3=27 states for optimal power/perf for his workload on a
> > simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
> > sockets etc..?
> >
> > As to the exact policy, I think the current 2 (load-balance + wakeup) is
> > the sensible one..
> >
> > Also, I still have this pending email from you asking about the topology
> > setup stuff I really need to reply to.. but people keep sending me bugs
> > reports :/
> >
> > But really short, look at kernel/sched/core.c:default_topology[]
> >
> > I'd like to get rid of sd_init_* into a single function like
> > sd_numa_init(), this would mean all archs would need to do is provide a
> > simple list of ever increasing masks that match their topology.
> >
> > To aid this we can add some SDTL_flags, initially I was thinking of:
> >
> > SDTL_SHARE_CORE -- aka SMT
> > SDTL_SHARE_CACHE -- LLC cache domain (typically multi-core)
> > SDTL_SHARE_MEMORY -- NUMA-node (typically socket)
> >
> > The 'performance' policy is typically to spread over shared resources so
> > as to minimize contention on these.
> >
>
> Would it be worth extending this architecture specification to contain
> more information like CPU_POWER for each core? After having experimented
> a bit with scheduling on big.LITTLE my experience is that more
> information about the platform is needed to make proper scheduling
> decisions. So if the topology definition is going to be more generic and
> be set up by the architecture it could be worth adding all the bits of
> information that the scheduler would need to that data structure.
>
> With such data structure, the scheduler would only need one knob to
> adjust the power/performance trade-off. Any thoughts?
>

One more thing. I have experimented with PJT's load-tracking patchset
and found it very useful for big.LITTLE scheduling. Is there any plans
for including them?

Morten

> > If you want to add some power we need some extra flags, maybe something
> > like:
> >
> > SDTL_SHARE_POWERLINE -- power domain (typically socket)
> >
> > so you know where the boundaries are where you can turn stuff off so you
> > know what/where to pack bits.
> >
> > Possibly we also add something like:
> >
> > SDTL_PERF_SPREAD -- spread on performance mode
> > SDTL_POWER_PACK -- pack on power mode
> >
> > To over-ride the defaults. But ideally I'd leave those until after we've
> > got the basics working and there is a clear need for them (with a
> > spread/pack default for perf/power aware).
>
> In my experience power optimized scheduling is quite tricky, especially
> if you still want some level of performance. For heterogeneous
> architecture packing might not be the best solution. Some indication of
> the power/performance profile of each core could be useful.
>
> Best regards,
> Morten
>
>
> _______________________________________________
> linaro-sched-sig mailing list
> linaro-sched-sig@xxxxxxxxxxxxxxxx
> http://lists.linaro.org/mailman/listinfo/linaro-sched-sig
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/