Re: [PATCH] sched/fair: no sync wakeup from interrupt context

From: Libo Chen
Date: Wed Jul 13 2022 - 15:35:28 EST


linux-kernel@xxxxxxxxxxxxxxx complains about html subpart... I am resending it in plain-text only
so everybody on the mailing list can see.

On 7/13/22 12:17, Libo Chen wrote:


On 7/13/22 09:40, Tim Chen wrote:
On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
Barry Song first pointed out that replacing sync wakeup with regular wakeup
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]

This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.

For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.
Will there be scenarios where you do want the task being woken up be pulled
to the CPU where the interrupt happened, as the data that needs to be accessed is
on local CPU/NUMA that interrupt happened? For example, interrupt associated with network
packets received. Sync still seems desirable, at least if there is no task currently
running on the CPU where interrupt happened. So maybe we should have some consideration
of the load on the CPU/NUMA before deciding whether we should do sync wake for such
interrupt.

There are only two places where sync wakeup matters: wake_affine_idle() and wake_affine_weight().
In wake_affine_idle(), it considers pulling if there is one runnable on the waking CPU because
of the belief that this runnable will voluntarily get off the runqueue. In wake_affine_weight(),
it basically takes off the waker's load again assuming the waker goes to sleep after the wakeup.
My argument is that this assumption doesn't really hold for wakeups from the interrupt contexts
when the waking CPU is non-idle. Wakeups from task context? sure, it seems to be a reasonable
assumption. For your idle case, I totally agree but I don't think having sync or not will actually
have any impacts here giving what the code does. Real impact comes fromMel's patch 7332dec055f2457c3
which makes it less likely to pull tasks when the waking CPU is idle. I believe we should consider
reverting 7332dec055f2 because a significant RDS latency regression has been spotted recently on our
system due to this patch.

Add a check to make sure that sync can only be set when in_task() is true. If
a wakeup is from interrupt context, sync flag will be 0 because in_task()
returns 0.

Tested in two scenarios: wakeups from 1) task contexts, expected to see no
performance changes; 2) interrupt contexts, expected to see better performance
under low/medium load and no regression under heavy load.

Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
daemon is active.

Hackbench (config-scheduler-unbound)
=========
process-pipes:
Baseline Patched
Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)

process-sockets:
Baseline Patched
Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)

thread-pipes:
Baseline Patched
Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)

thread-sockets:
Baseline Patched
Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)

Pgbench (config-db-pgbench-timed-ro-small)
=======
Baseline Patched
Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)

For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
conclude that there are no performance differences for wakeups from task context.
For pgbench, after many runs, 10~30% gains are very consistent at lower
client counts (< #cores per socket).
Can you provide some further insights on why pgebench is helped at low load
case? Is it because the woken tasks tend to stay put and not get moved around with interrupts
and maintain cache warmth?
Yes, and for read-only database workloads, the cache (whether it's incoming packet or not) on the interrupt
CPU isn't as performance critical as cache from its previous CPU where the db task run to process data.
To give you an example, consider a db client/server case, a client sends a request for a select query
through the network, the server accepts the query request and does all the heavy lifting and sends the result
back. For the server, the incoming packet is just a line of query whereas the CPU and its L3 db process previously
on has all the warm db caches, pulling it away from them is a crime :)  This may seem to be a little contradictory
to what I said earlier about the idle case and Mel's patch, but ¯\_(ツ)_/¯ it's hard to make all the workloads out
there happy. Anyway like I said earlier, this patch doesn't affect the idle case

At higher load, sync in wake_affine_idle() doesn't really matter because the waking CPU could easily have more than
1 runnable tasks. Sync in wake_affine_weight() also doesn't matter much as both sides have work to do, and cache
benefit of not pulling decreases simply because there are a lot more db processes under the same L3, they can compete
for the same cachelines.

Hope my explanation helps!

Libo
Tim

For higher client counts, both kernels are
very close, +-5% swings are quite common. Also NET_TX|RX softirq load
does spread out across both NUMA nodes in this test so there is no need to do
any explicit RPS/RFS.

[1]:https://urldefense.com/v3/__https://lkml.org/lkml/2021/11/5/234__;!!ACWV5N9M2RV99hQ!PTDVWQJ1pb2KldrRnXA3EJ2CPKyHAolke1QRjVLadEl2ctd3xYxTo5uZ5Dp91L3ExImjbrrqbNM27uEA-E9do7LM$
Signed-off-by: Libo Chen<libo.chen@xxxxxxxxxx>
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..59b210d2cdb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
- int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ /* Don't set sync for wakeup from irq/soft ctx */
+ int sync = in_task() && (wake_flags & WF_SYNC)
+ && !(current->flags & PF_EXITING);
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
--
2.31.1