Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller

From: Vincent Guittot
Date: Mon Apr 29 2019 - 08:25:12 EST


Hi Song,

On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@xxxxxx> wrote:
>
> Hi Morten and Vincent,
>
> > On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@xxxxxx> wrote:
> >
> > Hi Vincent,
> >
> >> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> >>
> >> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@xxxxxx> wrote:
> >>>
> >>> Hi Morten,
> >>>
> >>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> >>>>
> >>
> >>>>
> >>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
> >>>> workload. I'm not convinced that adding headroom gives any latency
> >>>> improvements beyond watering down the impact of your side jobs. AFAIK,
> >>>
> >>> We think the latency improvements actually come from watering down the
> >>> impact of side jobs. It is not just statistically improving average
> >>> latency numbers, but also reduces resource contention caused by the side
> >>> workload. I don't know whether it is from reducing contention of ALUs,
> >>> memory bandwidth, CPU caches, or something else, but we saw reduced
> >>> latencies when headroom is used.
> >>>
> >>>> the throttling mechanism effectively removes the throttled tasks from
> >>>> the schedule according to a specific duty cycle. When the side job is
> >>>> not throttled the main workload is experiencing the same latency issues
> >>>> as before, but by dynamically tuning the side job throttling you can
> >>>> achieve a better average latency. Am I missing something?
> >>>>
> >>>> Have you looked at your distribution of main job latency and tried to
> >>>> compare with when throttling is active/not active?
> >>>
> >>> cfs_bandwidth adjusts allowed runtime for each task_group each period
> >>> (configurable, 100ms by default). cpu.headroom logic applies gentle
> >>> throttling, so that the side workload gets some runtime in every period.
> >>> Therefore, if we look at time window equal to or bigger than 100ms, we
> >>> don't really see "throttling active time" vs. "throttling inactive time".
> >>>
> >>>>
> >>>> I'm wondering if the headroom solution is really the right solution for
> >>>> your use-case or if what you are really after is something which is
> >>>> lower priority than just setting the weight to 1. Something that
> >>>
> >>> The experiments show that, cpu.weight does proper work for priority: the
> >>> main workload gets priority to use the CPU; while the side workload only
> >>> fill the idle CPU. However, this is not sufficient, as the side workload
> >>> creates big enough contention to impact the main workload.
> >>>
> >>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
> >>>> SCHED_IDLE might not be enough). If your main job consist
> >>>> of lots of relatively short wake-ups things like the min_granularity
> >>>> could have significant latency impact.
> >>>
> >>> cpu.headroom gives benefits in addition to optimizations in pre-empt
> >>> side. By maintaining some idle time, fewer pre-empt actions are
> >>> necessary, thus the main workload will get better latency.
> >>
> >> I agree with Morten's proposal, SCHED_IDLE should help your latency
> >> problem because side job will be directly preempted unlike normal cfs
> >> task even lowest priority.
> >> In addition to min_granularity, sched_period also has an impact on the
> >> time that a task has to wait before preempting the running task. Also,
> >> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
> >> latency of a task.
> >>
> >> It would be nice to know if the latency problem comes from contention
> >> on cache resources or if it's mainly because you main load waits
> >> before running on a CPU
> >>
> >> Regards,
> >> Vincent
> >
> > Thanks for these suggestions. Here are some more tests to show the impact
> > of scheduler knobs and cpu.headroom.
> >
> > side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
> > --------------------------------------------------------------------------------
> > none | 0 | n/a | 1 ms | 45.20% | 1.00
> > ffmpeg | 0 | 1 | 10 ms | 3.38% | 1.46
> > ffmpeg | 0 | SCHED_IDLE | 1 ms | 5.69% | 1.42
> > ffmpeg | 20% | SCHED_IDLE | 1 ms | 19.00% | 1.13
> > ffmpeg | 30% | SCHED_IDLE | 1 ms | 27.60% | 1.08
> >
> > In all these cases, the main workload is loaded with same level of
> > traffic (request per second). Main workload latency numbers are normalized
> > based on the baseline (first row).
> >
> > For the baseline, the main workload runs without any side workload, the
> > system has about 45.20% idle CPU.
> >
> > The next two rows compare the impact of scheduling knobs cpu.weight and
> > sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
> > we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
> > see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
> > the main workload. However, it is not sufficient, as the latency overhead
> > is high (>40%).
> >
> > The last two rows show the benefit of cpu.headroom. With 20% headroom,
> > the latency is 1.13; while with 30% headroom, the latency is 1.08.
> >
> > We can also see a clear correlation between latency and global idle CPU:
> > more idle CPU yields better lower latency.
> >
> > Over all, these results show that cpu.headroom provides effective
> > mechanism to control the latency impact of side workloads. Other knobs
> > could also help the latency, but they are not as effective and flexible
> > as cpu.headroom.
> >
> > Does this analysis address your concern?

So, you results show that sched_idle class doesn't provide the
intended behavior because it still delay the scheduling of sched_other
tasks. In fact, the wakeup path of the scheduler doesn't make any
difference between a cpu running a sched_other and a cpu running a
sched_idle when looking for the idlest cpu and it can create some
contentions between sched_other tasks whereas a cpu runs sched_idle
task.
Viresh (cced to this email) is working on improving such behavior at
wake up and has sent an patch related to the subject:
https://lkml.org/lkml/2019/4/25/251
I'm curious if this would improve the results.

Regards,
Vincent

> >
> > Thanks,
> > Song
> >
>
> Could you please share your comments and suggestions on this work? Did
> the results address your questions/concerns?
>
> Thanks again,
> Song
>
> >>
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >>>>
> >>>> Morten
>