Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

From: Atish Patra
Date: Thu Sep 21 2017 - 12:52:48 EST


On 09/21/2017 12:50 AM, Joel Fernandes wrote:
On Wed, Sep 20, 2017 at 2:17 PM, Atish Patra <atish.patra@xxxxxxxxxx> wrote:
On 09/20/2017 03:23 PM, Joel Fernandes wrote:

On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
<brendan.jackman@xxxxxxx> wrote:


On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:

On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
<brendan.jackman@xxxxxxx> wrote:

On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:

[..]

IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the
presence


Actually wake_affine doesn't check for balance if previous/next cpu
are within the same shared cache domain. The difference is some time
ago it would return true for shared cache but now it returns false
as
of 4.14-rc1:


http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466

Since it would return false in the above wake up cases for task 1
and
2, it would then run select_idle_sibling on the previous CPU which
is
also within the big cluster, so I don't think it will make a
difference in this case... Infact what it returns probably doesn't
matter.


So my paragraph here was making a leap in reasoning, let me try to
fill
the gap: On SMP these tasks never need to move around. If by some
chance
they did get coscheduled, the first load balance would spread them
out
and
then every time they wake up from then on, prev_cpu is the sensible
choice. So it will look something like:

v CPU v ->time->

-------------
{ 0 (SAME) 11111111111
cache { -------------
{ 1 (SAME) 222222222222|
-------------
{ 2 (SAME) 33333333333
cache { -------------
{ 3 (SAME) 44444444444
-------------

So here, task 2 wakes up the other guys and when it's doing tasks 3
and
4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
basically wake_affine's job to decide between prev_cpu and
smp_processor_id(). So "if wake_affine is behaving correctly" the
problem that this patch aims to solve (i.e. the fact that we overload
the waker's LLC domain because of bias towards prev_cpu) does not
arise
on SMP.


Yes SMP, but your patch is for solving a problem for non-SMP. So your
original statement about wake_affine solving any problem for SMP is
not relevant I feel :-P. I guess you can just kill this para from the
commit message to prevent confusion.


Ok I take that back, you were talking about guarding this feature by
the SD_ASYM_CPUCAPACITY flag.

I don't think that protection would be helpful because you can have
the same issue if the tasks do different amount of work on SMP. So in
that case some threads might still complete before the others and you
run into the same thing.


Well assuming we're still talking about one task per CPU, if you have
tasks doing different amount of work there's still no reason to move the
longer-running threads around. The only reason that happens in my
example is because of the asym capacity.


Yes but you can very well have RT pressure and things that temporarily
change the capacity equality. Also this is a simple benchmark and for
any reason you have more than 1 task running on those other CPUs and
then the idle CPUs run some of the tasks and you run into a similar
situation that might need your patch..

The patch would be helpful only if it doesn't cross NUMA boundary. right ?

If NUMA comes into picture, not sure searching across NUMA may hurt more
than help, especially in this case.

I don't understand what you mean by "searching across NUMA", :-(, do you
mean the slow path?

As I said, if the SD_BALANCE_WAKE flag for the sched domain flag is not
set, then a full wide search isn't done anyway. You have this code that
sets the sd variable:

if (tmp->flags & sd_flag)
sd = tmp;

Since sd = NULL if the sched domain (tmp->flags) isn't set, you will
always have select_idle_sibling running and not doing the full search
if I understand correctly.

Correct. I was under the impression that the above logic is going
to changed to fix the possible issue raised by Brendan for big SMP server as well. I guess that was not the case.


Further adding the ASYM protection isn't sensible if capacities are
affected by RT and IRQ time etc anyway. Does that make sense?

Yup. We won't need ASYM protection.

Regards,
Atish
I am glad I understand the code a bit better now after staring at it
for quite some time but I think some more staring is needed.

- Joel