Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

From: K Prateek Nayak
Date: Thu Aug 31 2023 - 16:23:36 EST

Next message: Sean Christopherson: "Re: [PATCH v10 0/9] Linear Address Masking (LAM) KVM Enabling"
Previous message: Florian Fainelli: "Re: [PATCH 5.10 00/11] 5.10.194-rc1 review"
In reply to: David Vernet: "Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello David,

On 9/1/2023 12:41 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>
> Hi Prateek,
>
>> Even with the two patches, I still observe the following lock
>> contention when profiling the tbench 128-clients run with IBS:
>>
>> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>> - 10.94% native_queued_spin_lock_slowpath
>> - 10.73% _raw_spin_lock
>> - 9.57% __schedule
>> schedule_idle
>> do_idle
>> + cpu_startup_entry
>> - 0.82% task_rq_lock
>> newidle_balance
>> pick_next_task_fair
>> __schedule
>> schedule_idle
>> do_idle
>> + cpu_startup_entry
>>
>> Since David mentioned rq->avg_idle check is probably not the right step
>> towards the solution, this experiment introduces a per-shard
>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>> notifies of the possibility of one or more rq covered in the shard's
>> domain having a queued task. shard's overload flag is set at the same
>> time as "rq->rd->overload", and is cleared when shard's list is found
>> to be empty.
>
> I think this is an interesting idea, but I feel that it's still working
> against the core proposition of SHARED_RUNQ, which is to enable work
> conservation.

I don't think so! Work conservation is possible if there is an
imbalance. Consider the case where we 15 tasks in the shared_runq but we
have 16 CPUs, 15 of which are running these 15 tasks, and one going
idle. Work is conserved. What we need to worry about is tasks being in
the shared_runq that are queued on their respective CPU. This can only
happen if any one of the rq has nr_running >= 2, which is also the point
we are setting "shard->overload".

Now situation can change later and all tasks in the shared_runq might be
running on respective CPUs but "shard->overload" is only cleared when
the shared_runq becomes empty. If this is too late, maybe we can clear
it if periodic load balancing finds no queuing (somewhere around the
time we update nr_idle_scan).

So the window where we do not go ahead with popping a task from the
shared_runq_shard->list is between the list being empty and at least one
of the CPU associated with the shard reporting nr_running >= 2, which is
when work conservation is needed.

>
>> With these changes, following are the results for tbench 128-clients:
>
> Just to make sure I understand, this is to address the contention we're
> observing on tbench with 64 - 256 clients, right? That's my
> understanding from Gautham's reply in [0].
>
> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@xxxxxxxxxxxxxxxxxxxxxx/

I'm not sure if Gautham saw a contention with IBS but he did see an
abnormal blowup in newidle_balance() counts which he suspected were the
cause for the regression. I noticed the rq lock contention after doing a
fair bit of surgery. Let me go check if that was the case with vanilla
v3 too.

>
> If so, are we sure this change won't regress other workloads that would
> have benefited from the work conservation?

I don't think we'll regress any workloads as I explained above because
the "overload" flag being 0 almost (since update/access is not atomic)
always indicate a case where the tasks cannot be pulled. However, that
needs to be tested since there is a small behavior change in
shared_runq_pick_next_task(). Where previously if the task is running
on CPU, we would have popped it from shared_runq, did some lock
fiddling before finding out it is running, some more lock fiddling
before the function returned "-1", now with the changes here, it'll
simply return a "0" and although that is correct, we have seen some
interesting cases in past [1] where a random lock contention actually
helps certain benchmarks ¯\_(ツ)_/¯

[1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@xxxxxxx/

>
> Also, I assume that you don't see the improved contention without this,
> even if you include your fix to the newidle_balance() that has us skip
> over the <= LLC domain?

No improvements! The lock is still very much contended for. I wonder if
it could be because of the unlocking and locking the rq again in
shared_runq_pick_next_task() even when task is on CPU. Also since it
return -1 for this case, pick_next_task_fair() will return RETRY_TASK
which can have further implications.

>
> Thanks,
> David
>
> P.S. Taking off on vacation now, so any replies will be very delayed.
> Thanks again for working on this!

Hope you have a great vacation :)

>
>>
>> tip : 1.00 (var: 1.00%)
>> tip + v3 + series till patch 2 : 0.41 (var: 1.15%) (diff: -58.81%)
>> tip + v3 + full series : 1.01 (var: 0.36%) (diff: +00.92%)
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>> ---
>> [..snip..]
>>

--
Thanks and Regards,
Prateek

Next message: Sean Christopherson: "Re: [PATCH v10 0/9] Linear Address Masking (LAM) KVM Enabling"
Previous message: Florian Fainelli: "Re: [PATCH 5.10 00/11] 5.10.194-rc1 review"
In reply to: David Vernet: "Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]