Re: [PATCH v2] sched: rt: Make RT capacity aware

From: Vincent Guittot
Date: Tue Oct 29 2019 - 08:20:59 EST


On Tue, 29 Oct 2019 at 12:48, Qais Yousef <qais.yousef@xxxxxxx> wrote:
>
> On 10/29/19 12:17, Vincent Guittot wrote:
> > On Tue, 29 Oct 2019 at 12:02, Qais Yousef <qais.yousef@xxxxxxx> wrote:
> > >
> > > On 10/29/19 09:13, Vincent Guittot wrote:
> > > > On Wed, 9 Oct 2019 at 12:46, Qais Yousef <qais.yousef@xxxxxxx> wrote:
> > > > >
> > > > > Capacity Awareness refers to the fact that on heterogeneous systems
> > > > > (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> > > > > when placing tasks we need to be aware of this difference of CPU
> > > > > capacities.
> > > > >
> > > > > In such scenarios we want to ensure that the selected CPU has enough
> > > > > capacity to meet the requirement of the running task. Enough capacity
> > > > > means here that capacity_orig_of(cpu) >= task.requirement.
> > > > >
> > > > > The definition of task.requirement is dependent on the scheduling class.
> > > > >
> > > > > For CFS, utilization is used to select a CPU that has >= capacity value
> > > > > than the cfs_task.util.
> > > > >
> > > > > capacity_orig_of(cpu) >= cfs_task.util
> > > > >
> > > > > DL isn't capacity aware at the moment but can make use of the bandwidth
> > > > > reservation to implement that in a similar manner CFS uses utilization.
> > > > > The following patchset implements that:
> > > > >
> > > > > https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@xxxxxxxxxxxxxxx/
> > > > >
> > > > > capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
> > > > >
> > > > > For RT we don't have a per task utilization signal and we lack any
> > > > > information in general about what performance requirement the RT task
> > > > > needs. But with the introduction of uclamp, RT tasks can now control
> > > > > that by setting uclamp_min to guarantee a minimum performance point.
> > > > >
> > > > > ATM the uclamp value are only used for frequency selection; but on
> > > > > heterogeneous systems this is not enough and we need to ensure that the
> > > > > capacity of the CPU is >= uclamp_min. Which is what implemented here.
> > > > >
> > > > > capacity_orig_of(cpu) >= rt_task.uclamp_min
> > > > >
> > > > > Note that by default uclamp.min is 1024, which means that RT tasks will
> > > > > always be biased towards the big CPUs, which make for a better more
> > > > > predictable behavior for the default case.
> > > >
> > > > hmm... big cores are not always the best choices for rt tasks, they
> > > > generally took more time to wake up or to switch context because of
> > > > the pipeline depth and others branch predictions
> > >
> > > Can you quantify this into a number? I suspect this latency should be in the
> >
> > As a general rule, we pinned IRQs on little core because of such
> > responsiveness difference. I don't have numbers in mind as the tests
> > were run at the beg of b.L system.. few years ago
> > Then, if you look at some idle states definitions in DT, you will see
> > that exit latency of cluster down state of big core of hikey960 is
> > 2900us vs 1600us for little
>
> I don't think hikey960 is a good system to use as a reference. SD845 shows more
> sensible numbers

It is not worse than another

>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/qcom/sdm845.dtsi?h=v5.4-rc5#n407
>
> >
> > > 200-300us range. And the difference between little and big should be much
> > > smaller than that, no? We can't give guarantees in Linux in that order in
> > > general and for serious real time users they have to do extra tweaks like
> > > disabling power management which can introduce latency and hinder determinism.
> > > Beside enabling PREEMPT_RT.
> > >
> > > For generic systems a few ms is the best we can give and we can easily fall out
> > > of this without any tweaks.
> > >
> > > The choice of going to the maximum performance point in the system for RT tasks
> > > by default goes beyond this patch anyway. I'm just making it consistent here
> > > since we have different performance levels and RT didn't understand this
> > > before.
> > >
> > > So what I'm doing here is just make things consistent rather than change the
> > > default.
> > >
> > > What do you suggest?
> >
> > Making big cores the default CPUs for all RT tasks is not a minor
> > change and IMO locality should stay the default behavior when there is
> > no uclamp constraint
>
> How this is affecting locality? The task will always go to the big core, so it
> should be local.

local with the waker
You will force rt task to run on big cluster although waker, data and
interrupts can be on little one.
So making big core as default is far from always being the best choice

>
> And before introducing uclamp the default was going to the maximum frequency
> anyway - which is the highest performance point. So what this does is basically
> make sure that if we asked for a 1024 capacity, we get that.
>
> Beside the decision is taken by the setup of the system wide uclamp.min. We
> can change this to be something smaller but I don't think we can come up with
> a better value by default. Admin should tune this to something smaller if the
> performance of their little cores is good for their needs.
>
> What this patch says if I want my uclamp.min of my RT task to be 1024, then we
> give better guarantees it'll get that 1024 performance it asked for. And the
> default of 1024 is consistent with what Linux has always done for RT out of the
> box.
>
> --
> Qais Yousef