Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Mel Gorman
Date: Fri May 15 2020 - 04:47:49 EST


On Thu, May 14, 2020 at 05:31:22PM +0200, Peter Zijlstra wrote:
> On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> > Complete shot in the dark but restore adjust_numa_imbalance() and try
> > this
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..0b31f4468d5b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> > struct rq_flags rf;
> >
> > #if defined(CONFIG_SMP)
> > - if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> > + if (sched_feat(TTWU_QUEUE)) {
>
> just saying that this has the risk of regressing other workloads, see:
>
> 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
>

Sure, I didn't really think that this was appropriate but it was the best I
had at the time. This can be a lot more targetted but it took me a while
to put together a picture as it was difficult for me to analyse. I was
going to post this as a seperate RFC but given this thread, what do you
think of this?

---8<---
sched: Wake cache-local tasks via wake_list if wakee CPU is polling

There are two separate method for waking a task from idle, one for
tasks running on CPUs that share a cache and one for when the caches are
separate. The methods can loosely be called local and remote even though
this is not directly related to NUMA and instead is due to the expected
cost of accessing data that is cache-hot on another CPU. The "local" costs
are expected to be relatively cheap as they share the LLC in comparison to
a remote IPI that is potentially required when using the "remote" wakeup.
The problem is that the local wakeup is not always cheaper and it appears
to have degraded even further around the 4.19 mark.

There appears to be a couple of reasons why it can be slower.

The first is specific to pairs of tasks where one or both rapidly enters
idle. For example, on netperf UDP_STREAM, the client sends a bunch of
packets, wakes the server, the server processes some packets and goes
back to sleep. There is a relationship between the tasks but it's not
strictly synchronous. The timing is different if the client/server are on
separate NUMA nodes and netserver is more likely to enter idle (measured
as server entering idle 10% more times when tasks are local vs remote
but machine-specific). However, the wakeups are so rapid that the wakeup
happens while the server is descheduling. That forces the waker to spin
on smp_cond_load_acquire for longer. In this case, it can be cheaper to
add the task to the rq->wake_list even if that potentially requires an IPI.

The second is that the local wakeup path is simply not always
that fast. Using ftrace, the cost of the locks, update_rq_clock and
ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
a single instance, the cost of the "remote" wakeup for try_to_wake_up
was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
wakeup. When there are tens of thousands of wakeups per second, these costs
accumulate and manifest as a throughput regression in netperf UDP_STREAM.

The essential difference in costs comes down to whether the CPU is fully
idle, a task is descheduling or polling in poll_idle(). This patch
special cases ttwu_queue() to use the "remote" method if the CPUs
task is polling as it's generally cheaper to use the wake_list in that
case than the local method because an IPI should not be required. As it is
race-prone, a reschedule IPI may still be sent but in that case the local
wakeup would also have to send a reschedule IPI so it should be harmless.

The benefit is not universal as it requires the specific pattern of wakeups
that happen when the wakee is likely to be descheduling. netperf UDP_STREAM
is one of the more obvious examples and it is more obvious since the
load balancer was reconciled and the client and server are more likely to
be running on the same NUMA node which is why it was missed for so long.

netperf-udp
5.7.0-rc5 5.7.0-rc5
vanilla wakelist-v1r1
Hmean send-64 211.25 ( 0.00%) 228.61 * 8.22%*
Hmean send-128 413.49 ( 0.00%) 458.00 * 10.77%*
Hmean send-256 786.31 ( 0.00%) 879.30 * 11.83%*
Hmean send-1024 3055.75 ( 0.00%) 3429.45 * 12.23%*
Hmean send-2048 6052.79 ( 0.00%) 6878.99 * 13.65%*
Hmean send-3312 9208.09 ( 0.00%) 10832.19 * 17.64%*
Hmean send-4096 11268.45 ( 0.00%) 12968.39 * 15.09%*
Hmean send-8192 17943.12 ( 0.00%) 19886.07 * 10.83%*
Hmean send-16384 29732.94 ( 0.00%) 32297.44 * 8.63%*
Hmean recv-64 211.25 ( 0.00%) 228.61 * 8.22%*
Hmean recv-128 413.49 ( 0.00%) 458.00 * 10.77%*
Hmean recv-256 786.31 ( 0.00%) 879.30 * 11.83%*
Hmean recv-1024 3055.75 ( 0.00%) 3429.45 * 12.23%*
Hmean recv-2048 6052.79 ( 0.00%) 6878.99 * 13.65%*
Hmean recv-3312 9208.09 ( 0.00%) 10832.19 * 17.64%*
Hmean recv-4096 11268.45 ( 0.00%) 12968.39 * 15.09%*
Hmean recv-8192 17943.12 ( 0.00%) 19886.06 * 10.83%*
Hmean recv-16384 29731.75 ( 0.00%) 32296.08 * 8.62%*

It's less obvious on something like TCP_STREAM as there is a stricter
relationship between the client and server but with the patch, it's
much less variable

netperf-tcp
5.7.0-rc5 5.7.0-rc5
vanilla wakelist-v1r1
Hmean 64 769.28 ( 0.00%) 779.96 * 1.39%*
Hmean 128 1497.54 ( 0.00%) 1509.33 * 0.79%*
Hmean 256 2806.00 ( 0.00%) 2787.59 ( -0.66%)
Hmean 1024 9700.80 ( 0.00%) 9720.02 ( 0.20%)
Hmean 2048 17071.28 ( 0.00%) 16968.92 ( -0.60%)
Hmean 3312 22976.20 ( 0.00%) 23012.80 ( 0.16%)
Hmean 4096 25885.74 ( 0.00%) 26152.49 * 1.03%*
Hmean 8192 34014.83 ( 0.00%) 33878.67 ( -0.40%)
Hmean 16384 39482.14 ( 0.00%) 40307.22 ( 2.09%)
Stddev 64 2.60 ( 0.00%) 0.17 ( 93.31%)
Stddev 128 3.69 ( 0.00%) 1.52 ( 58.97%)
Stddev 256 31.06 ( 0.00%) 14.76 ( 52.47%)
Stddev 1024 96.50 ( 0.00%) 46.43 ( 51.89%)
Stddev 2048 115.98 ( 0.00%) 62.59 ( 46.03%)
Stddev 3312 293.28 ( 0.00%) 41.92 ( 85.71%)
Stddev 4096 173.45 ( 0.00%) 123.19 ( 28.98%)
Stddev 8192 783.59 ( 0.00%) 129.62 ( 83.46%)
Stddev 16384 1012.37 ( 0.00%) 252.08 ( 75.10%)

Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
---
kernel/sched/core.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..59077c7c6660 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
struct rq_flags rf;

#if defined(CONFIG_SMP)
- if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+ if (sched_feat(TTWU_QUEUE)) {
+ /*
+ * A remote wakeup is often expensive as can require
+ * an IPI and the wakeup path is slow. However, in
+ * the specific case where the target CPU is idle
+ * and polling, the CPU is active and rapidly checking
+ * if a reschedule is needed. In this case, the idle
+ * task just needs to be marked for resched and p
+ * will rapidly be requeued which is less expensive
+ * than the direct wakeup path.
+ */
+ if (cpus_share_cache(smp_processor_id(), cpu)) {
+ struct thread_info *ti = task_thread_info(p);
+ typeof(ti->flags) val = READ_ONCE(ti->flags);
+
+ if (val & _TIF_POLLING_NRFLAG)
+ goto activate;
+ }
+
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
ttwu_queue_remote(p, cpu, wake_flags);
return;
}
#endif

+activate:
rq_lock(rq, &rf);
update_rq_clock(rq);
ttwu_do_activate(rq, p, wake_flags, &rf);