Re: [RFC PATCH] sched: Queue task on wakelist in the same llc if the wakee cpu is idle

From: Tianchen Ding
Date: Fri May 13 2022 - 03:05:37 EST


On 2022/5/13 14:37, Peter Zijlstra wrote:
On Fri, May 13, 2022 at 02:24:27PM +0800, Tianchen Ding wrote:
We notice the commit 518cd6234178 ("sched: Only queue remote wakeups
when crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist.

No; this was because of cache bouncing.

As I understand, avoiding cache bouncing is the reason to do queue_wakelist accross llc. This can be the same reason why we try to do queue_wakelist within the same llc now. It should be better for the wakee cpu handling its own rq. Will there be some other side effects?


Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.

Benefits:
Queuing the task on idle cpu can help improving performance on waker cpu
and utilization on wakee cpu, and further improve locality because
the wakee cpu can handle its own rq. This patch helps improving rt on
our real java workloads where wakeup happens frequently.

Does this patch bring IPI flooding?
For archs with TIF_POLLING_NRFLAG (e.g., x86), there will be no
difference if the wakee cpu is idle polling. If the wakee cpu is idle
but not polling, the later check_preempt_curr() will send IPI too.

For archs without TIF_POLLING_NRFLAG (e.g., arm64), the IPI is
unavoidable, since the later check_preempt_curr() will send IPI when
wakee cpu is idle.

Benchmark:
running schbench -m 2 -t 8 on 8269CY:

without patch:
Latency percentiles (usec)
50.0000th: 10
75.0000th: 14
90.0000th: 16
95.0000th: 16
*99.0000th: 17
99.5000th: 20
99.9000th: 23
min=0, max=28

with patch:
Latency percentiles (usec)
50.0000th: 6
75.0000th: 8
90.0000th: 9
95.0000th: 9
*99.0000th: 10
99.5000th: 10
99.9000th: 14
min=0, max=16

We've also tested unixbench and see about 10% improvement on Pipe-based
Context Switching, and no performance regression on other test cases.

For arm64, we've tested schbench and unixbench on Kunpeng920, the
results show that,

What is a kunpeng and how does it's topology look?

It's an arm64 processor produced by Huawei. It's topology has NUMA and cluster. See the commit log of c5e22feffdd7 ("topology: Represent clusters of CPUs within a die") for detail.
In fact I also tried to test on Ampere. But there maybe sth wrong on my machine and the kernel only get upto l2 cache info. (Which means each cpu has a different sd_llc_id so the patch will take no effect.) :-(


the improvement is not as obvious as on x86, and
there's no performance regression.

x86 is wide and varied; what x86 did you test?

I've tested on Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz. Do you need more info on other machines?