Re: [PATCH] block/blk-mq: Don't complete locally if capacities are different

From: Qais Yousef
Date: Fri Feb 02 2024 - 18:20:19 EST


On 01/22/24 22:42, Qais Yousef wrote:
> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
> CPUs have equal capacities and only LLC cache can make a different on
> perceived performance. But this assumption falls apart on HMP systems
> where LLC is shared, but the CPUs have different capacities. Staying
> local then can have a big performance impact if the IO request was done
> from a CPU with higher capacity but the interrupt is serviced on a lower
> capacity CPU.
>
> Introduce new cpus_gte_capacity() function to enable do the additional
> check.

As I was preparing to send a new version. I thought it is worth noting here
that I initially had cpus_equal_capacity() to check if the performance on the
two cpus is equal, but then opted to change it to >= so that we favour perf.

But now I am having 2nd thoughts again.

If the interrupt was on a bigger core but the request comes from a little core,
is it better to move it to little core (which can save power) to match the
requester, or better honour the fact that someone has put the interrupt on
a bigger core and maybe they prefer to keep requests there by default?

I am leaning back towards cpus_equal_capacity() so we match the performance
level the requester is running at. irqbalancer can end up shuffling interrupts
potentially and I believe it is better for the scheduler to handle the
performance level the requester should be running at, and then here our job
should be to match that level without making more assumptions.


--
Qais Yousef

>
> Without the patch I see the BLOCK softirq always running on little cores
> (where the hardirq is serviced). With it I can see it running on all
> cores.
>
> This was noticed after the topology change [1] where now on a big.LITTLE
> we truly get that the LLC is shared between all cores where as in the
> past it was being misrepresented for historical reasons. The logic
> exposed a missing dependency on capacities for such systems where there
> can be a big performance difference between the CPUs.
>
> This of course introduced a noticeable change in behavior depending on
> how the topology is presented. Leading to regressions in some workloads
> as the performance of the BLOCK softirq on littles can be noticeably
> worse.
>
> [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
>
> Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx>
> ---
> block/blk-mq.c | 5 +++--
> include/linux/sched/topology.h | 6 ++++++
> kernel/sched/core.c | 8 ++++++++
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ac18f802c027..9b2d278a7ae7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1163,10 +1163,11 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
> if (force_irqthreads())
> return false;
>
> - /* same CPU or cache domain? Complete locally */
> + /* same CPU or cache domain and capacity? Complete locally */
> if (cpu == rq->mq_ctx->cpu ||
> (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
> - cpus_share_cache(cpu, rq->mq_ctx->cpu)))
> + cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> + cpus_gte_capacity(cpu, rq->mq_ctx->cpu)))
> return false;
>
> /* don't try to IPI to an offline CPU */
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a6e04b4a21d7..31cef5780ba4 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -176,6 +176,7 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
> void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>
> +bool cpus_gte_capacity(int this_cpu, int that_cpu);
> bool cpus_share_cache(int this_cpu, int that_cpu);
> bool cpus_share_resources(int this_cpu, int that_cpu);
>
> @@ -226,6 +227,11 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> {
> }
>
> +static inline bool cpus_gte_capacity(int this_cpu, int that_cpu)
> +{
> + return true;
> +}
> +
> static inline bool cpus_share_cache(int this_cpu, int that_cpu)
> {
> return true;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index db4be4921e7f..db5ab4b3cee7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3954,6 +3954,14 @@ void wake_up_if_idle(int cpu)
> }
> }
>
> +bool cpus_gte_capacity(int this_cpu, int that_cpu)
> +{
> + if (this_cpu == that_cpu)
> + return true;
> +
> + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu);
> +}
> +
> bool cpus_share_cache(int this_cpu, int that_cpu)
> {
> if (this_cpu == that_cpu)
> --
> 2.34.1
>