Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

From: Mike Galbraith
Date: Mon Sep 28 2015 - 14:22:46 EST


On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:

> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case.

We're always interested in "flips", as the point is to try to identify
N:M load components, and when they may overload a socket. The hope is
to get it more right than wrong, as making the tracking really accurate
is too expensive for the fast path.

> Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.

Why not? A wakeup is a wakeup is a wakeup, they all do the same thing.
If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
its opinion, then look for an idle CPU near the waker's CPU if it says
OK, or near wakee's previous CPU if it says go away.

> We wake the wakee anywhere and loose hot cache.

Yeah, sometimes we'll make tasks drag their data to them when we could
have dragged the task to the data in the name of trying to crank up CPU
utilization. At some point, _somebody_ has to drag their data across
interconnect, but we really don't know if/when the data transport cost
will pay off in better utilization.

-Mike

(I'll take a peek at below when damn futexes get done kicking my a$$)

> I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>
>
> Before:
>
> trans. | tps (i) | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
>
> After:
>
> trans. | tps (i) | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019
> 12635366 | 63176.399769 | 63187.152569
> 12676890 | 63384.396440 | 63400.930755
> 12639949 | 63199.526330 | 63210.460753
> 12670626 | 63353.079951 | 63363.274143
> 12647001 | 63209.613698 | 63219.812331
>
> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> - if (sd_flag & SD_BALANCE_WAKE)
> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> + if (sd_flag & SD_BALANCE_WAKE) {
> + want_affine = 1;
> + if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> + goto want_affine;
> + if (wake_wide(p))
> + goto want_affine;
> + }
>
> rcu_read_lock();
> for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> break;
> }
>
> - if (affine_sd) {
> +want_affine:
> + if (want_affine) {
> sd = NULL; /* Prefer wake_affine over balance flags */
> - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + if (affine_sd && wake_affine(affine_sd, p, sync))
> new_cpu = cpu;
> - }
> -
> - if (!sd) {
> - if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> - new_cpu = select_idle_sibling(p, new_cpu);
> -
> + new_cpu = select_idle_sibling(p, new_cpu);
> } else while (sd) {
> struct sched_group *group;
> int weight;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/