On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:I am not too against cache-local migrations here because they are still under the
The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
On 7/13/22 09:40, Tim Chen wrote:
On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:There are only two places where sync wakeup matters: wake_affine_idle() and
Barry Song first pointed out that replacing sync wakeup with regular wakeupWill there be scenarios where you do want the task being woken up be pulled
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]
This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.
For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.
to the CPU where the interrupt happened, as the data that needs to be accessed is
on local CPU/NUMA that interrupt happened? For example, interrupt associated with network
packets received. Sync still seems desirable, at least if there is no task currently
running on the CPU where interrupt happened. So maybe we should have some consideration
of the load on the CPU/NUMA before deciding whether we should do sync wake for such
interrupt.
wake_affine_weight().
In wake_affine_idle(), it considers pulling if there is one runnable on the
waking CPU because
of the belief that this runnable will voluntarily get off the runqueue. In
wake_affine_weight(),
it basically takes off the waker's load again assuming the waker goes to
sleep after the wakeup.
My argument is that this assumption doesn't really hold for wakeups from the
interrupt contexts
when the waking CPU is non-idle. Wakeups from task context? sure, it seems
to be a reasonable
assumption. For your idle case, I totally agree but I don't think having
sync or not will actually
have any impacts here giving what the code does. Real impact comes fromMel's
patch 7332dec055f2457c3
which makes it less likely to pull tasks when the waking CPU is idle. I
believe we should consider
reverting 7332dec055f2 because a significant RDS latency regression has been
spotted recently on our
system due to this patch.
It still allowed cache-local migrations on the assumption that the incoming
data was critical enough to justify losing any other cache-hot data. You
state explicitly that "the interrupt CPU isn't as performance critical asThis is the tricky part, I didn't explain it well. For rds-stress, it's a
cache from its previous CPU" so that assumption was incorrect, at least
in your case. I don't have a counter example where the interrupt data *is*
more important than any other cache-hot data so the check can go.
I think a revert would not achieve what you want as a plain revert would
still allow an interrupt to pull a task from an arbitrary location as sync
is not checked. A follow-up to your patch or an updated version should notSorry, it's Reliable Datagram Sockets. We run rds-stress to measure rds
check available_idle_cpu at all in wake_affine_idle as it's only idle the
wake is from interrupt context and vcpu_is_preempted is not necessarily
justification for pulling a task due to an interrupt.
Something like this but needs testing with your target loads, particularly
the RDS (Relational Database Service?) latency regression;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7b275672c89..e55a3a67a442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
* soonest. For the purpose of speed we only consider the waking and previous
* CPU.
*
- * wake_affine_idle() - only considers 'now', it check if the waking CPU is
- * cache-affine and is (or will be) idle.
+ * wake_affine_idle() - only considers 'now', it checks if the waker task is a
+ * sync wakeup from a CPU that should be idle soon.
*
* wake_affine_weight() - considers the weight to reflect the average
* scheduling latency of the CPUs. This seems to work
@@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
static int
wake_affine_idle(int this_cpu, int prev_cpu, int sync)
{
- /*
- * If this_cpu is idle, it implies the wakeup is from interrupt
- * context. Only allow the move if cache is shared. Otherwise an
- * interrupt intensive workload could force all tasks onto one
- * node depending on the IO topology or IRQ affinity settings.
- *
- * If the prev_cpu is idle and cache affine then avoid a migration.
- * There is no guarantee that the cache hot data from an interrupt
- * is more important than cache hot data on the prev_cpu and from
- * a cpufreq perspective, it's better to have higher utilisation
- * on one CPU.
- */
- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-
if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;