Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

From: Aaron Lu
Date: Tue Aug 01 2023 - 03:24:25 EST


On Wed, Jul 26, 2023 at 02:56:19PM -0400, Mathieu Desnoyers wrote:

... ...

> The updated patch:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..1c7d5bd2968b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING &&
> + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS)
> + return 1;
> +
> if (rq->curr != rq->idle)
> return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 81ac605b9cd5..57a49a5524f0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -97,6 +97,9 @@
> # define SCHED_WARN_ON(x) ({ (void)(x), 0; })
> #endif
> +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */
> +#define IDLE_CPU_DELAY_MAX_RUNNING 4
> +
> struct rq;
> struct cpuidle_state;
>

I gave this patch a run on Intel SPR(2 sockets/112cores/224cpus) and I
also noticed huge improvement when running hackbench, especially for
group=32/fds=20 case:

when group=10/fds=20(400 tasks):
time wakeups/migration tg->load_avg%
base: 43s 27874246/13953871 25%
this patch: 32s 33200766/244457 2%
my patch: 37s 29186608/16307254 2%

when group=20/fds=20(800 tasks):
time wakeups/migrations tg->load_avg%
base: 65s 27108751/16238701 27%
this patch: 45s 35718552/1691220 3%
my patch: 48s 37506974/24797284 2%

when group=32/fds=20(1280 tasks):
time wakeups/migrations tg->load_avg%
base: 150s 36902527/16423914 36%
this patch: 57s 30536830/6035346 6%
my patch: 73s 45264605/21595791 3%

One thing I noticed is, after this patch, the migration on wakeup path
has dramatically reduced(see above wakeups/migrations, the number were
captured for 5s during the run). I think this makes sense because now a
cpu is more likely to be considered idle so a wakeup task will more
likely stay on its prev_cpu. And when migrations is reduced, the cost of
accessing tg->load_avg is also reduced(tg->load_avg% is the sum of
update_cfs_group()% + update_load_avg()% as reported by perf). I think
this is part of the reason why performance improved on this machine.

Since I've been working on reducing the cost of accessing tg->load_avg[1],
I also gave my patch a run. According to the result, even when the cost
of accessing tg->load_avg is smaller for my patch, Mathieu's patch is
still faster. It's not clear to me why, maybe it has something to do
with cache reuse since my patch doesn't inhibit migration? I suppose ipc
could reflect this?

[1]: https://lore.kernel.org/lkml/20230718134120.81199-1-aaron.lu@xxxxxxxxx/

Thanks,
Aaron