Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
From: Song Liu
Date: Tue May 14 2019 - 17:00:41 EST
Hi Vincent,
> On May 10, 2019, at 11:22 AM, Song Liu <songliubraving@xxxxxx> wrote:
>
>
>
>> On Apr 30, 2019, at 9:54 AM, Song Liu <songliubraving@xxxxxx> wrote:
>>
>>
>>
>>> On Apr 30, 2019, at 12:20 PM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>>>
>>> Hi Song,
>>>
>>> On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@xxxxxx> wrote:
>>>>
>>>>
>>>>
>>>>> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>>>>>
>>>>> Hi Song,
>>>>>
>>>>> On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@xxxxxx> wrote:
>>>>>>
>>>>>> Hi Morten and Vincent,
>>>>>>
>>>>>>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@xxxxxx> wrote:
>>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@xxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hi Morten,
>>>>>>>>>
>>>>>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>>>>>>>> workload. I'm not convinced that adding headroom gives any latency
>>>>>>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>>>>>>>>
>>>>>>>>> We think the latency improvements actually come from watering down the
>>>>>>>>> impact of side jobs. It is not just statistically improving average
>>>>>>>>> latency numbers, but also reduces resource contention caused by the side
>>>>>>>>> workload. I don't know whether it is from reducing contention of ALUs,
>>>>>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>>>>>>>> latencies when headroom is used.
>>>>>>>>>
>>>>>>>>>> the throttling mechanism effectively removes the throttled tasks from
>>>>>>>>>> the schedule according to a specific duty cycle. When the side job is
>>>>>>>>>> not throttled the main workload is experiencing the same latency issues
>>>>>>>>>> as before, but by dynamically tuning the side job throttling you can
>>>>>>>>>> achieve a better average latency. Am I missing something?
>>>>>>>>>>
>>>>>>>>>> Have you looked at your distribution of main job latency and tried to
>>>>>>>>>> compare with when throttling is active/not active?
>>>>>>>>>
>>>>>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>>>>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>>>>>>>> throttling, so that the side workload gets some runtime in every period.
>>>>>>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>>>>>>>> don't really see "throttling active time" vs. "throttling inactive time".
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm wondering if the headroom solution is really the right solution for
>>>>>>>>>> your use-case or if what you are really after is something which is
>>>>>>>>>> lower priority than just setting the weight to 1. Something that
>>>>>>>>>
>>>>>>>>> The experiments show that, cpu.weight does proper work for priority: the
>>>>>>>>> main workload gets priority to use the CPU; while the side workload only
>>>>>>>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>>>>>>>> creates big enough contention to impact the main workload.
>>>>>>>>>
>>>>>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>>>>>>>> SCHED_IDLE might not be enough). If your main job consist
>>>>>>>>>> of lots of relatively short wake-ups things like the min_granularity
>>>>>>>>>> could have significant latency impact.
>>>>>>>>>
>>>>>>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>>>>>>>> side. By maintaining some idle time, fewer pre-empt actions are
>>>>>>>>> necessary, thus the main workload will get better latency.
>>>>>>>>
>>>>>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>>>>>>>> problem because side job will be directly preempted unlike normal cfs
>>>>>>>> task even lowest priority.
>>>>>>>> In addition to min_granularity, sched_period also has an impact on the
>>>>>>>> time that a task has to wait before preempting the running task. Also,
>>>>>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>>>>>>>> latency of a task.
>>>>>>>>
>>>>>>>> It would be nice to know if the latency problem comes from contention
>>>>>>>> on cache resources or if it's mainly because you main load waits
>>>>>>>> before running on a CPU
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vincent
>>>>>>>
>>>>>>> Thanks for these suggestions. Here are some more tests to show the impact
>>>>>>> of scheduler knobs and cpu.headroom.
>>>>>>>
>>>>>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> none | 0 | n/a | 1 ms | 45.20% | 1.00
>>>>>>> ffmpeg | 0 | 1 | 10 ms | 3.38% | 1.46
>>>>>>> ffmpeg | 0 | SCHED_IDLE | 1 ms | 5.69% | 1.42
>>>>>>> ffmpeg | 20% | SCHED_IDLE | 1 ms | 19.00% | 1.13
>>>>>>> ffmpeg | 30% | SCHED_IDLE | 1 ms | 27.60% | 1.08
>>>>>>>
>>>>>>> In all these cases, the main workload is loaded with same level of
>>>>>>> traffic (request per second). Main workload latency numbers are normalized
>>>>>>> based on the baseline (first row).
>>>>>>>
>>>>>>> For the baseline, the main workload runs without any side workload, the
>>>>>>> system has about 45.20% idle CPU.
>>>>>>>
>>>>>>> The next two rows compare the impact of scheduling knobs cpu.weight and
>>>>>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
>>>>>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
>>>>>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
>>>>>>> the main workload. However, it is not sufficient, as the latency overhead
>>>>>>> is high (>40%).
>>>>>>>
>>>>>>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
>>>>>>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
>>>>>>>
>>>>>>> We can also see a clear correlation between latency and global idle CPU:
>>>>>>> more idle CPU yields better lower latency.
>>>>>>>
>>>>>>> Over all, these results show that cpu.headroom provides effective
>>>>>>> mechanism to control the latency impact of side workloads. Other knobs
>>>>>>> could also help the latency, but they are not as effective and flexible
>>>>>>> as cpu.headroom.
>>>>>>>
>>>>>>> Does this analysis address your concern?
>>>>>
>>>>> So, you results show that sched_idle class doesn't provide the
>>>>> intended behavior because it still delay the scheduling of sched_other
>>>>> tasks. In fact, the wakeup path of the scheduler doesn't make any
>>>>> difference between a cpu running a sched_other and a cpu running a
>>>>> sched_idle when looking for the idlest cpu and it can create some
>>>>> contentions between sched_other tasks whereas a cpu runs sched_idle
>>>>> task.
>>>>
>>>> I don't think scheduling delay is the only (or dominating) factor of
>>>> extra latency. Here are some data to show it.
>>>>
>>>> I measured IPC (instructions per cycle) of the main workload under
>>>> different scenarios:
>>>>
>>>> side-load | cpu.headroom | side/cpu.weight | IPC
>>>> ----------------------------------------------------
>>>> none | 0% | N/A | 0.66
>>>> ffmpeg | 0% | SCHED_IDLE | 0.53
>>>> ffmpeg | 20% | SCHED_IDLE | 0.58
>>>> ffmpeg | 30% | SCHED_IDLE | 0.62
>>>>
>>>> These data show that the side workload has a negative impact on the
>>>> main workload's IPC. And cpu.headroom could help reduce this impact.
>>>>
>>>> Therefore, while optimizations in the wakeup path should help the
>>>> latency; cpu.headroom would add _significant_ benefit on top of that.
>>>
>>> It seems normal that side workload has a negative impact on IPC
>>> because of resources sharing but your previous results showed a 42%
>>> regression of latency with sched_idle which is can't be only linked to
>>> resources access contention
>>
>> Agreed. I think both scheduling latency and resource contention
>> contribute noticeable latency overhead to the main workload. The
>> scheduler optimization by Viresh would help reduce the scheduling
>> latency, but it won't help the resource contention. Hopefully, with
>> optimizations in the scheduler, we can meet the latency target with
>> smaller cpu.headroom. However, I don't think scheduler optimizations
>> will eliminate the need of cpu.headroom, as the resource contention
>> always exists, and the impact could be significant.
>>
>> Do you have further concerns with this patchset?
>>
>> Thanks,
>> Song
>
> Here are some more results with both Viresh's patch and the cpu.headroom
> set. In these tests, the side job runs with SCHED_IDLE, so we get benefit
> of Viresh's patch.
>
> We collected another metric here, average "cpu time" used by the requests.
> We also presented "wall time" and "wall - cpu" time. "wall time" is the
> same as "latency" in previous results. Basically, "wall time" includes cpu
> time, scheduling latency, and time spent waiting for data (from data base,
> memcache, etc.). We don't have good data that separates scheduling latency
> and time spent waiting for data, so we present "wall - cpu" time, which is
> the sum of the two. Time spent waiting for data should not change in these
> tests, so changes in "wall - cpu" mostly comes from scheduling latency.
> All the latency numbers are normalized based on the "wall time" of the
> first row.
>
> side job | cpu.headroom | cpu-idle | wall time | cpu time | wall - cpu
> ------------------------------------------------------------------------
> none | n/a | 42.4% | 1.00 | 0.31 | 0.69
> ffmpeg | 0 | 10.8% | 1.17 | 0.38 | 0.79
> ffmpeg | 25% | 22.8% | 1.08 | 0.35 | 0.73
>
> From these results, we can see that Viresh's patch reduces the latency
> overhead of the side job, from 42% (in previous results) to 17%. And
> a 25% cpu.headroom further reduces the latency overhead to 8%.
> cpu.headroom reduces time spent in "cpu time" and "wall - cpu" time,
> which means cpu.headroom yields better IPC and lower scheduling latency.
>
> I think these data demonstrate that
>
> 1. Viresh's work is helpful in reducing scheduling latency introduced
> by SCHED_IDLE side jobs.
> 2. cpu.headroom work provides mechanism to further reduce scheduling
> latency on top of Viresh's work.
>
> Therefore, the combination of the two work would give us mechanisms to
> control the latency overhead of side workloads.
>
> @Vincent, do these data and analysis make sense from your point of view?
Do you have further questions/concerns with this set?
As the data shown, scheduling latency is not the only resource of high
latency here. In fact, with hyper threading and other shared system
resources (cache, memory, etc.), side workload would always negatively
impact the latency of the main workload. It is impossible to eliminate
these impacts with scheduler optimizations. On the other hand,
cpu.headroom provides mechanism to limit such impact.
Optimization and protection are two sides of the problem. While we
spend a lot of time optimizing the workload (so Viresh's work is really
interesting for us), cpu.headroom works on the protection side. There
are multiple reasons behind the high latencies. cpu.headroom provides
universal protection against all these.
With the protection of cpu.headroom, we can actually do optimizations
more efficiently, as we can safely start with a high headroom, and
then try to lower it.
Please let me know your thoughts on this.
Thanks,
Song