Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

From: Abel Wu
Date: Thu Jun 30 2022 - 06:46:18 EST

Next message: kernel test robot: "[morimoto:sound-2022-06-30-v3 19/19] sound/soc/codecs/jz4770.c:768:10: error: 'struct snd_soc_dai_driver' has no member named 'playback'"
Previous message: Ido Schimmel: "Re: [PATCH] mellanox/mlxsw: fix repeated words in comments"
In reply to: Chen Yu: "Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/30/22 12:16 PM, Chen Yu Wrote:

On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:

On 6/27/22 6:13 PM, Abel Wu Wrote:
There seems like not much difference except hackbench pipe test at
certain groups (30~110).

OK, smaller LLC domain seems to not have much difference, which might
suggest that by leveraging load balance code path, the read/write
to LLC shared mask might not be the bottleneck. I have an vague
impression that during Aubrey's cpumask searching for idle CPUs
work[1], there is concern that updating the shared mask in large LLC
has introduced cache contention and performance degrading. Maybe we
can find that regressed test case to verify.
[1] https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@xxxxxxxxx/

I just went through Aubrey's v1-v11 patches and didn't find any
particular tests other than hackbench/tbench/uperf. Please let
me know if I missed something, thanks!

I am intended to provide better scalability
by applying the filter which will be enabled when:

- The LLC is large enough that simply traversing becomes
in-sufficient, and/or

- The LLC is loaded that unoccupied cpus are minority.

But it would be very nice if a more fine grained pattern works well
so we can drop the above constrains.

We can first try to push a simple version, and later optimize it.
One concern about v4 is that, we changed the logic in v3, which recorded
the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
more "stable" because there are more than 1 running tasks on that runqueue.
It is more likely to remain "occupied" for a while. That is to say,
nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
is unoccupied. The former would bring less false negative/positive.

Yes, I like the 'overloaded mask' too, but the downside is extra
cpumask ops needed in the SIS path (the added cpumask_andnot).
Besides, in this patch, the 'overloaded mask' is also unstable due
to the state is maintained at core level rather than per-cpu, some
more thoughts are in cover letter.

By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
with SIS_UTIL enabled. Overall it looks good, and netperf has especially
significant improvement when the load approaches overloaded(which is aligned
with your comment above). I'll re-run the netperf for several cycles to check the
standard deviation. And I'm also curious about v3's performance because it
tracks overloaded CPUs, so I'll also test on v3 with small modifications.

Thanks very much for your reviewing and testing.

Abel

Next message: kernel test robot: "[morimoto:sound-2022-06-30-v3 19/19] sound/soc/codecs/jz4770.c:768:10: error: 'struct snd_soc_dai_driver' has no member named 'playback'"
Previous message: Ido Schimmel: "Re: [PATCH] mellanox/mlxsw: fix repeated words in comments"
In reply to: Chen Yu: "Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]