Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core

From: Abel Wu
Date: Mon Aug 29 2022 - 10:12:01 EST


Hi Mel, thanks for reviewing!

On 8/29/22 9:08 PM, Mel Gorman Wrote:
On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
When SIS_UTIL is enabled, SIS domain scan will be skipped if
the LLC is overloaded. Since the overloaded status is checked
in the load balancing at LLC level, the interval is llc_size
miliseconds. The duration might be long enough to affect the
overall system throughput if idle cores are out of reach in
SIS domain scan.

Signed-off-by: Abel Wu <wuyun.abel@xxxxxxxxxxxxx>

Split this patch to move the this_sd lookup into the SIS_PROP section.

OK


Otherwise, this is the most controversial patch in the series and the
most likely to cause problems where it wins on some machines and
workloads and loses on others.

The corner case to worry about is a workload that is rapidly idling and
the has_idle_core hint is often wrong. This can happen with hackbench for
example, or at least this was true when I last looked which is quite some
time ago. If this hint is often wrong, then there will be full scan cost
incurred regardless of SIS_UTIL that often fails to find a core.

Yes, I can't agree more. And the situation can become worse when LLC
gets bigger as you mentioned below. I will exclude this part from v2.


So, first suggestion is to move this patch to the end of the series as
the other patches are relatively harmless. They could even be merged in
isolation as a cleanup.

Second, using the other patches as your baseline, include in the
changelog what you tested that showed a benefit, what type of machine
it was and in particular include the number of cores, nodes and the
span of the LLC. If you measured any regressions, include that as well
and make a call on whether you think the patch wins more than it loses.
The reason to include that information is because the worst corner case
(all CPUs scanned uselessly) costs more the larger the span of LLC is.
If all the testing was done on a 2-core SMT-2 machine, the overhead of the
patch would be negligible but very different if the LLC span is 20 cores.
While the patch is not obviously wrong, it definitely needs better data,
Even if you do not have a large test machine available, it's still helpful
to have it in the changelog because a reviewer like me can decide "this
needs testing on a larger machine".

Thanks for your detailed suggestions. I will attach benchmark results
along with some analysis next time when posting performance related
patches.


I did queue this up the entire series for testing and while it sometimes
benefitted, there were large counter-examples. tbench4 on Zen3 showed some
large regressions (e.g. 41% loss on 64 clients with a massive increase in
variability) which has multiple small L3 caches per node. tbench4 (slightly
modified in mmtests to produce a usable metric) in general showed issues
across multiple x86-64 machines both AMD and Intel, multiple generations
with a noticable increase in system CPU usage when the client counts reach
the stage where the machine is close to saturated. perfpipe for some
weird reason showed a large regression apparent regresion on Broadwell
but the variability was also crazy so probably can be ignored. hackbench
overall looked ok so it's possible I'm wrong about the idle_cores hint
being often wrong on that workload and I should double check that. It's
possible the hint is wrong some of the times but right often enough to
either benefit from using an idle core or by finding an idle sibling which
may be preferable to stacking tasks on the same CPU.

I attached my test result in one of my replies[1]: netperf showed ~3.5%
regression, hackbench improved a lot, and tbench4 drew. I tested several
times and the results didn't seem vary much.


The lack of data and the lack of a note on the potential downside is the
main reason why I'm not acking patch. tbench4 was a particular concern on
my own tests and it's possible a better patch would be a hybrid approach
where a limited search of two cores (excluding target) is allowed even if
SIS_UTIL indicates overload but has_idle_cores is true.


Agreed. And the reason I will exclude this part in v2 is that I plan to
make it part of another feature, SIS filter [2]. The latest version of
SIS filter (not posted yet but soon) will contain all the idle cpus so
we don't need a full scan when has_idle_core. Scanning the filter then
is enough. While it may still cost too much if too many false positive
bits in the filter. Does this direction make sense to you?

[1] https://lore.kernel.org/lkml/eaa543fa-421d-2194-be94-6a5e24a33b37@xxxxxxxxxxxxx/
[2] https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@xxxxxxxxxxxxx/

Thanks & Best Regards,
Abel