Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

From: Ze Gao
Date: Mon Apr 15 2024 - 07:05:22 EST


On Mon, Apr 15, 2024 at 6:25 PM Madadi Vineeth Reddy
<vineethr@xxxxxxxxxxxxx> wrote:
>
> Hi Ze Gao,
>
> On 15/04/24 07:33, Ze Gao wrote:
> > On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> > <vineethr@xxxxxxxxxxxxx> wrote:
> >>
> >> Hi Ze Gao,
> >>
> >> On 13/03/24 14:28, Ze Gao wrote:
> >>> We observered select_idle_sibling() is likely to return the *target* cpu
> >>> early which is likely to be the previous cpu this task is running on even
> >>> when it's actually not within the affinity list newly set, from where after
> >>> we can only rely on select_fallback_rq() to choose one for us at its will
> >>> (the first valid mostly for now).
> >>>
> >>> However, the one chosen by select_fallback_rq() is highly likely not a
> >>> good enough candidate, sometimes it has to rely on load balancer to kick
> >>> in to place itself to a better cpu, which adds one or more unnecessary
> >>> migrations in no doubt. For example, this is what I get when I move task
> >>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >>>
> >>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> >>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >>>
> >>
> >> I am able to reproduce this scenario of having an extra migration through load balance
> >> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> >> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> >> busy task running on CPU 33.
> >>
> >>> The thing is we can actually do better if we do checks early and take more
> >>> advantages of the *target* in select_idle_sibling(). That is, we continue
> >>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> >>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> >>> especially when the newly allowed cpu set shares some cpu resources with
> >>> *target*.
> >>>
> >>> And with this change, we clearly see the improvement when I move task 3964
> >>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >>>
> >>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> >>
> >> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
> >>
> >> On placing some perf probes and testing,
> >> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> >> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> >> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> >> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> >> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> >> which was CPU 4.
> >
> > My best guess is that you may have hit the code path for running tasks
> > (taskset happens right after the task is woken up). Should that happen,
> > the picking is done via:
> >
> > dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
>
> Yes, I verified that cpumask_any_and_distribute is hit.
>
> >
> > and it also makes sense that select_fallback_rq() returns 4 since that happens
> > before you change the affinities.
>
> In this case, 4 is passed as an argument to select_fallback_rq(), it's not the return
> value. It actually returns the first affined CPU which is 33 here.
>
> Also, I see select_fallback_rq() happening after cpumask_any_and_distribute.
>
> >
> > you may need to rule out this case first :)
> >
>
> I am not sure on how to avoid hitting cpumask_any_and_distribute, Can you explain how did you move the tasks in case
> of stress-ng?

Acutally there is no easy way to avoid taskset on the running task
(maybe -l with a low load expectation which increases the chance
the task sleeps). and then it is simple and stupid: let the trace be
as much verbose as I can, and filter what I pay attention to.

Regards,
Ze

> Thanks and Regards
> Madadi Vineeth Reddy
>
> > Regards,
> > Ze
> >
> >> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> >> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> >> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> >> default target at the end.
> >>
> >> Thanks and Regards
> >> Madadi Vineeth Reddy
> >>
> >>>
> >>> Note we do the same check for *prev* in select_idle_sibling() as well.
> >>>
> >>> Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
> >>> ---
> >>
>