Re: [PATCH 03/11] sched: Extend scheduler's asym packing

From: Morten Rasmussen
Date: Thu Aug 25 2016 - 07:26:19 EST


On Thu, Aug 18, 2016 at 03:36:44PM -0700, Srinivas Pandruvada wrote:
> From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>
> We generalize the scheduler's asym packing to provide an
> ordering of the cpu beyond just the cpu number. This allows
> the use of the ASYM_PACKING scheduler machinery to move
> loads to prefered CPU in a sched domain based on a preference
> defined by sched_asym_prefer function.
>
> We also record the most preferred cpu in a sched group when
> we build the cpu's capacity for fast lookup of preferred cpu
> during load balancing.
>
> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>

[...]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c64fc51..75e1002 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -532,6 +532,22 @@ struct dl_rq {
>
> #ifdef CONFIG_SMP
>
> +#ifndef sched_asym_prefer
> +
> +/* For default ASYM_PACKING, lower numbered cpu is prefered */
> +static inline bool sched_asym_prefer(int a, int b)
> +{
> + return a < b;
> +}
> +
> +#endif /* sched_asym_prefer */

Isn't this a very significant change in the interface between
architecture and the scheduler?

If I'm not mistaken, our current interface is quite strict when it comes
to information passed from the architecture into the scheduler. We allow
'topology' flags, but not behavioural flags, to be set by the
architecture, and the architecture can expose current and max cpu
capacities through the arch_scale_*() functions. For NUMA, we can expose
'distance' between nodes (and more?).

These are meant to describe the system topology to the scheduler, so it
can make better decisions on its own. sched_asym_prefer() is is not only
affecting scheduler behaviour, it is handing off scheduling decisions to
architecture code. In essence allowing logic to be plugged into the
scheduler, although with a somewhat limited scope of impact.

Should this been seen as the architecture/scheduler is up for revision
and we will start allowing architecture code to plug in function to
affect scheduling behaviour?

I haven't reviewed the entire patch set in detail, but why can't the cpu
priority list be handed to the scheduler instead of moving scheduling
decisions out of the scheduler?

Isn't it possible to express the cpu 'priority' as different cpu
capacities instead? Without understanding the details of ITMT, it seems
to me that what you really have is different cpu compute capacities, and
that is what we have cpu capacity for.

Is the intention long term to change the cpu priority order on the fly,
otherwise I don't see why you would put the logic in architecture code?

Finally, the existing callback functions from the scheduler to
architecture code are prefixed with arch_, I think this one should do
the same to make it clear that this function may be implemented by
generic scheduler code.

Thanks,
Morten