Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

From: Ze Gao
Date: Mon Apr 15 2024 - 07:06:47 EST


On Mon, Apr 15, 2024 at 6:36 PM Madadi Vineeth Reddy
<vineethr@xxxxxxxxxxxxx> wrote:
>
> On 15/04/24 11:10, Ze Gao wrote:
> > On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> > <vineethr@xxxxxxxxxxxxx> wrote:
> >>
> >> Hi Ze Gao,
> >>
> >> On 13/03/24 14:28, Ze Gao wrote:
> >>> We observered select_idle_sibling() is likely to return the *target* cpu
> >>> early which is likely to be the previous cpu this task is running on even
> >>> when it's actually not within the affinity list newly set, from where after
> >>> we can only rely on select_fallback_rq() to choose one for us at its will
> >>> (the first valid mostly for now).
> >>>
> >>> However, the one chosen by select_fallback_rq() is highly likely not a
> >>> good enough candidate, sometimes it has to rely on load balancer to kick
> >>> in to place itself to a better cpu, which adds one or more unnecessary
> >>> migrations in no doubt. For example, this is what I get when I move task
> >>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >>>
> >>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> >>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >>>
> >>
> >> I am able to reproduce this scenario of having an extra migration through load balance
> >> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> >> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> >> busy task running on CPU 33.
> >>
> >>> The thing is we can actually do better if we do checks early and take more
> >>> advantages of the *target* in select_idle_sibling(). That is, we continue
> >>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> >>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> >>> especially when the newly allowed cpu set shares some cpu resources with
> >>> *target*.
> >>>
> >>> And with this change, we clearly see the improvement when I move task 3964
> >>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >>>
> >>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> >>
> >> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
> >>
> >> On placing some perf probes and testing,
> >> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> >> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> >> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> >> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> >> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> >> which was CPU 4.
> >
> > After second thoughts, it indeed could happen if CPU 4 shares nothing
> > with CPU 33(34),
> > for example, in different numa nodes.
> >
> > IOW, it cannot benefit from select_idle_siblings() and has to rely on
> > select_fallback_rq
> > as the last resort. Just like what I said in the changelog, this patch
> > aims to improve rq
> > selection for cases where the newly allowed cpu set shares some cpu
> > resources with
> > the old cpu set.
>
> Right. In power 10(where I tested), LLC is smaller being at SMT4 small core
> level. So, I think there are less chances of affined CPUs to share resources
> with the old CPU set.
>
> I also ran schbench and hackbench just to make sure that there is no regression
> in powerpc. Luckily there is no regression.
> Tested-by: Madadi Vineeth Reddy <vineethr@xxxxxxxxxxxxx>

Thanks for the testing!

Sincerely,
Ze

> Thanks and Regards
> Madadi Vineeth Reddy
>
> >
> > Sorry for not being able to recall all details immediately, since this
> > thread has been inactive
> > for a long while and receives no feedback from sched folks.
> >
> > Best,
> > Ze
> >
> >> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> >> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> >> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> >> default target at the end.
> >>
> >> Thanks and Regards
> >> Madadi Vineeth Reddy
> >>
> >>>
> >>> Note we do the same check for *prev* in select_idle_sibling() as well.
> >>>
> >>> Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
> >>> ---
> >>
>