Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
From: Peter Zijlstra
Date: Fri May 15 2020 - 07:29:00 EST
On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> sched: Wake cache-local tasks via wake_list if wakee CPU is polling
>
> There are two separate method for waking a task from idle, one for
> tasks running on CPUs that share a cache and one for when the caches are
> separate. The methods can loosely be called local and remote even though
> this is not directly related to NUMA and instead is due to the expected
> cost of accessing data that is cache-hot on another CPU. The "local" costs
> are expected to be relatively cheap as they share the LLC in comparison to
> a remote IPI that is potentially required when using the "remote" wakeup.
> The problem is that the local wakeup is not always cheaper and it appears
> to have degraded even further around the 4.19 mark.
>
> There appears to be a couple of reasons why it can be slower.
>
> The first is specific to pairs of tasks where one or both rapidly enters
> idle. For example, on netperf UDP_STREAM, the client sends a bunch of
> packets, wakes the server, the server processes some packets and goes
> back to sleep. There is a relationship between the tasks but it's not
> strictly synchronous. The timing is different if the client/server are on
> separate NUMA nodes and netserver is more likely to enter idle (measured
> as server entering idle 10% more times when tasks are local vs remote
> but machine-specific). However, the wakeups are so rapid that the wakeup
> happens while the server is descheduling. That forces the waker to spin
> on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> add the task to the rq->wake_list even if that potentially requires an IPI.
>
> The second is that the local wakeup path is simply not always
> that fast. Using ftrace, the cost of the locks, update_rq_clock and
> ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
> a single instance, the cost of the "remote" wakeup for try_to_wake_up
> was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
> wakeup. When there are tens of thousands of wakeups per second, these costs
> accumulate and manifest as a throughput regression in netperf UDP_STREAM.
>
> The essential difference in costs comes down to whether the CPU is fully
> idle, a task is descheduling or polling in poll_idle(). This patch
> special cases ttwu_queue() to use the "remote" method if the CPUs
> task is polling as it's generally cheaper to use the wake_list in that
> case than the local method because an IPI should not be required. As it is
> race-prone, a reschedule IPI may still be sent but in that case the local
> wakeup would also have to send a reschedule IPI so it should be harmless.
We don't in fact send a wakeup IPI when polling. So this might end up
with an extra IPI.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..59077c7c6660 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> struct rq_flags rf;
>
> #if defined(CONFIG_SMP)
> - if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> + if (sched_feat(TTWU_QUEUE)) {
> + /*
> + * A remote wakeup is often expensive as can require
> + * an IPI and the wakeup path is slow. However, in
> + * the specific case where the target CPU is idle
> + * and polling, the CPU is active and rapidly checking
> + * if a reschedule is needed.
Not strictly true; MWAIT can be very deep idle, it's just that with
POLLING we indicate we do not have to send an IPI to wake up. Just
setting the TIF_NEED_RESCHED flag is sufficient to wake up -- the
monitor part of monitor-wait.
> In this case, the idle
> + * task just needs to be marked for resched and p
> + * will rapidly be requeued which is less expensive
> + * than the direct wakeup path.
> + */
> + if (cpus_share_cache(smp_processor_id(), cpu)) {
> + struct thread_info *ti = task_thread_info(p);
> + typeof(ti->flags) val = READ_ONCE(ti->flags);
> +
> + if (val & _TIF_POLLING_NRFLAG)
> + goto activate;
I'm completely confused... the result here is that if you're polling you
do _NOT_ queue on the wake_list, but instead immediately enqueue.
(which kinda makes sense, since if the remote CPU is idle, it doesn't
have these lines in its cache anyway)
But the subject and comments all seem to suggest the opposite !?
Also, this will fail compilation when !defined(TIF_POLLING_NRFLAGG).
> + }
> +
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;
> }
> #endif
>
> +activate:
The labels wants to go inside the ifdef, otherwise GCC will complain
about unused labels etc..
> rq_lock(rq, &rf);
> update_rq_clock(rq);
> ttwu_do_activate(rq, p, wake_flags, &rf);