Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

From: Chen Yu
Date: Sat Jul 09 2022 - 04:55:31 EST


On Thu, Jun 30, 2022 at 06:46:08PM +0800, Abel Wu wrote:
>
> On 6/30/22 12:16 PM, Chen Yu Wrote:
> > On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:
> > >
> > > On 6/27/22 6:13 PM, Abel Wu Wrote:
> > > There seems like not much difference except hackbench pipe test at
> > > certain groups (30~110).
> > OK, smaller LLC domain seems to not have much difference, which might
> > suggest that by leveraging load balance code path, the read/write
> > to LLC shared mask might not be the bottleneck. I have an vague
> > impression that during Aubrey's cpumask searching for idle CPUs
> > work[1], there is concern that updating the shared mask in large LLC
> > has introduced cache contention and performance degrading. Maybe we
> > can find that regressed test case to verify.
> > [1] https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@xxxxxxxxx/
>
> I just went through Aubrey's v1-v11 patches and didn't find any
> particular tests other than hackbench/tbench/uperf. Please let
> me know if I missed something, thanks!
>
I haven't found any testcase that could trigger the cache contention
issue. I thought we could stick with these testcases for now, especially
for tbench, it has detected a cache issue described in
https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@xxxxxxx
if I understand correctly.
> > > I am intended to provide better scalability
> > > by applying the filter which will be enabled when:
> > >
> > > - The LLC is large enough that simply traversing becomes
> > > in-sufficient, and/or
> > >
> > > - The LLC is loaded that unoccupied cpus are minority.
> > >
> > > But it would be very nice if a more fine grained pattern works well
> > > so we can drop the above constrains.
> > >
> > We can first try to push a simple version, and later optimize it.
> > One concern about v4 is that, we changed the logic in v3, which recorded
> > the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
> > more "stable" because there are more than 1 running tasks on that runqueue.
> > It is more likely to remain "occupied" for a while. That is to say,
> > nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
> > is unoccupied. The former would bring less false negative/positive.
>
> Yes, I like the 'overloaded mask' too, but the downside is extra
> cpumask ops needed in the SIS path (the added cpumask_andnot).
> Besides, in this patch, the 'overloaded mask' is also unstable due
> to the state is maintained at core level rather than per-cpu, some
> more thoughts are in cover letter.
>
I see.
> >
> > By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
> > with SIS_UTIL enabled. Overall it looks good, and netperf has especially
> > significant improvement when the load approaches overloaded(which is aligned
> > with your comment above). I'll re-run the netperf for several cycles to check the
> > standard deviation. And I'm also curious about v3's performance because it
> > tracks overloaded CPUs, so I'll also test on v3 with small modifications.
>
> Thanks very much for your reviewing and testing.
>
I modified your v3 patch a little bit, and the test result shows good improvement
on netperf and no significant regression on schbench/tbench/hackbench on this draft
patch. I would like to vote for your v3 version as it seems to be more straightforward,
what do you think of the following change: