Re: [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization

From: Vincent Guittot
Date: Tue Mar 14 2023 - 03:55:47 EST


On Thu, 9 Mar 2023 at 15:22, Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
>
>
>
> >> CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
> >> value set. Hence all period timers would align at expiry.
> >> This happens when there are multiple CPU cgroup's.
> >>
> >> There is a performance gain that can be achieved here if the timers are
> >> interleaved when the utilization of each CPU cgroup is low and total
> >> utilization of all the CPU cgroup's is less than 50%. If the timers are
> >> interleaved, then the unthrottled cgroup can run freely without many
> >> context switches and can also benefit from SMT Folding. This effect will
> >> be further amplified in SPLPAR environment.
> >>
> >> This commit adds a random offset after initializing each hrtimer. This
> >> would result in interleaving the timers at expiry, which helps in achieving
> >> the said performance gain.
> >>
> >> This was tested on powerpc platform with 8 core SMT=8. Socket power was
> >> measured when the workload. Benchmarked the stress-ng with power
> >> information. Throughput oriented benchmarks show significant gain up to
> >> 25% while power consumption increases up to 15%.
> >>
> >> Workload: stress-ng --cpu=32 --cpu-ops=50000.
> >> 1CG - 1 cgroup is running.
> >> 2CG - 2 cgroups are running together.
> >> Time taken to complete stress-ng in seconds and power is in watts.
> >> each cgroup is throttled at 25% with 100ms as the period value.
> >> 6.2-rc6 | with patch
> >> 8 core 1CG power 2CG power | 1CG power 2 CG power
> >> 27.5 80.6 40 90 | 27.3 82 32.3 104
> >> 27.5 81 40.2 91 | 27.5 81 38.7 96
> >> 27.7 80 40.1 89 | 27.6 80 29.7 106
> >> 27.7 80.1 40.3 94 | 27.6 80 31.5 105
> >>
> >> Latency might be affected by this change. That could happen if the CPU was
> >> in a deep idle state which is possible if we interleave the timers. Used
> >> schbench for measuring the latency. Each cgroup is throttled at 25% with
> >> period value is set to 100ms. Numbers are when both the cgroups are
> >> running simultaneously. Latency values don't degrade much. Some
> >> improvement is seen in tail latencies.
> >>
> >> 6.2-rc6 with patch
> >> Groups: 16
> >> 50.0th: 39.5 42.5
> >> 75.0th: 924.0 922.0
> >> 90.0th: 972.0 968.0
> >> 95.0th: 1005.5 994.0
> >> 99.0th: 4166.0 2287.0
> >> 99.5th: 7314.0 7448.0
> >> 99.9th: 15024.0 13600.0
> >>
> >> Groups: 32
> >> 50.0th: 819.0 463.0
> >> 75.0th: 1596.0 918.0
> >> 90.0th: 5992.0 1281.5
> >> 95.0th: 13184.0 2765.0
> >> 99.0th: 21792.0 14240.0
> >> 99.5th: 25696.0 18920.0
> >> 99.9th: 33280.0 35776.0
> >>
> >> Groups: 64
> >> 50.0th: 4806.0 3440.0
> >> 75.0th: 31136.0 33664.0
> >> 90.0th: 54144.0 58752.0
> >> 95.0th: 66176.0 67200.0
> >> 99.0th: 84736.0 91520.0
> >> 99.5th: 97408.0 114048.0
> >> 99.9th: 136448.0 140032.0
> >>
> >> Signed-off-by: Shrikanth Hegde<sshegde@xxxxxxxxxxxxxxxxxx>
> >> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >> Suggested-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> >
> > Reviewed-by: Ben Segall <bsegall@xxxxxxxxxx>

Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

>
> Ben, Thank you very much for the review.
>
> Peter, Ingo, Vincent,
> Could you please provide your inputs? Any other benchmarks should be run?
>
>
> Sorry about the subject line. I should have included sched/fair. I realized it after
> mail was sent. I sent another mail with only subject line changed.
> They are same except subject line.
> https://lore.kernel.org/lkml/20230223185918.1500132-1-sshegde@xxxxxxxxxxxxxxxxxx/
>
> Should I resend the patch with better change log and a cover letter detailing
> test results instead?
>
> >
> >>
> >> Initial RFC PATCH, discussions and details on the problem:
> >> Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@xxxxxxxxxxxxxxxxxx/
> >> Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@xxxxxxxxxxxxxxxxxx/
> >>
> >> ---
> >> kernel/sched/fair.c | 4 ++++
> >> 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ff4dbbae3b10..2a4a0969e04f 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5923,6 +5923,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> >> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> >> cfs_b->period_timer.function = sched_cfs_period_timer;
> >> +
> >> + /* Add a random offset so that timers interleave */
> >> + hrtimer_set_expires(&cfs_b->period_timer,
> >> + get_random_u32_below(cfs_b->period));
> >> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> >> cfs_b->slack_timer.function = sched_cfs_slack_timer;
> >> cfs_b->slack_started = false;
> >> --
> >> 2.31.1