Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

From: Kirill Tkhai
Date: Tue Sep 29 2015 - 12:00:46 EST

Next message: Javi Merino: "Re: [PATCH 3/3] Thermal: do thermal zone update after a cooling device registered"
Previous message: Steven Rostedt: "Re: [RFC][PATCH 11/11] sched: More notrace"
In reply to: Mike Galbraith: "Re: [PATCH] sched/fair: Skip wake_affine() for core siblings"
Next in thread: Kirill Tkhai: "Re: [PATCH] sched/fair: Skip wake_affine() for core siblings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 29.09.2015 17:55, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> int want_affine = 0;
>> int sync = wake_flags & WF_SYNC;
>>
>> - if (sd_flag & SD_BALANCE_WAKE)
>> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> + if (sd_flag & SD_BALANCE_WAKE) {
>> + want_affine = 1;
>> + if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> + goto want_affine;
>> + if (wake_wide(p))
>> + goto want_affine;
>> + }
>
> That blew wake_wide() right out of the water.
>
> It's not only about things like pgbench. Drive multiple tasks in a Xen
> guest (single event channel dom0 -> domu, and no select_idle_sibling()
> to save the day) via network, and watch workers fail to be all they can
> be because they keep being stacked up on the irq source. Load balancing
> yanks them apart, next irq stacks them right back up. I met that in
> enterprise land, thought wake_wide() should cure it, and indeed it did.

1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
current, doesn't it? We more often don't set affine_sd. So, the skipped
part of patch (skipped in quote) selects prev_cpu.

2)I thought about waking by irq handler and even was going to ask why
we use affine logic for such wakeups. Device handlers usually aren't
bound, timers may migrate since NO_HZ logic presents. The only explanation
I found is unbound timers is very unlikely case (I added statistics printk
to my local sched_debug to check that). But if we have the situations like
you described above, don't we have to disable affine logic for in_interrupt()
cases?

3)I ask about just because (being outside of scheduler history) it's a little
bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
profit is less or more clear: smp_processor_id()'s sd_llc may contain some
data, which is interesting for a wakee, and this minimizes cache misses.
But we do the same in other cases too, and at every migration we loose
itlb, dtlb... Of course, it requires more accurate patches, then posted
(not so rude patches).

Thanks,
Kirill

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Javi Merino: "Re: [PATCH 3/3] Thermal: do thermal zone update after a cooling device registered"
Previous message: Steven Rostedt: "Re: [RFC][PATCH 11/11] sched: More notrace"
In reply to: Mike Galbraith: "Re: [PATCH] sched/fair: Skip wake_affine() for core siblings"
Next in thread: Kirill Tkhai: "Re: [PATCH] sched/fair: Skip wake_affine() for core siblings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]