Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

From: Gautham R. Shenoy
Date: Wed Aug 02 2023 - 02:32:47 EST


Hello David,


On Tue, Jul 25, 2023 at 03:22:55PM -0500, David Vernet wrote:
> On Fri, Jul 21, 2023 at 02:42:57PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
>
> Hello Gautham,
>
> Thank you for taking the time to run these benchmarks. Apologies for the
> delayed response -- I've been traveling this week.

No issues. As you can see, there has been an delay from my end as well.


>
> > On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> > > Changes
> > > -------
> > >
> > > This is v2 of the shared wakequeue (now called shared runqueue)
> > > patchset. The following are changes from the RFC v1 patchset
> > > (https://lore.kernel.org/lkml/20230613052004.2836135-1-void@xxxxxxxxxxxxx/).
> > >
> > > v1 -> v2 changes:
> > > - Change name from swqueue to shared_runq (Peter)
> > >
> > > - Sharded per-LLC shared runqueues to avoid contention on
> > > scheduler-heavy workloads (Peter)
> > >
> > > - Pull tasks from the shared_runq in newidle_balance() rather than in
> > > pick_next_task_fair() (Peter and Vincent)
> > >
> > > - Rename a few functions to reflect their actual purpose. For example,
> > > shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> > >
> > > - Expose move_queued_task() from core.c rather than migrate_task_to()
> > > (Peter)
> > >
> > > - Properly check is_cpu_allowed() when pulling a task from a shared_runq
> > > to ensure it can actually be migrated (Peter and Gautham)
> > >
> > > - Dropped RFC tag
> > >
> > > This patch set is based off of commit ebb83d84e49b ("sched/core: Avoid
> > > multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") on the
> > > sched/core branch of tip.git.
> >
> > I have evaluated this v2 patchset on AMD Zen3 and Zen4 servers.
> >
> > tldr:
> >
> > * We see non-trivial improvements on hackbench on both Zen3 and Zen4
> > until the system is super-overloaded, at which point we see
> > regressions.
>
> This makes sense to me. SHARED_RUNQ is more likely to help performance
> when the system is not over-utilized, as it has more of a chance to
> actually increase work conservation. If the system is over-utilized,
> it's likely that a core will be able to find a task regardless of
> whether it looks at the shared runq.
>
> That said, I wasn't able to reproduce the regressions (with --groups 16)
> on my 7950X, presumably because it only has 8 cores / CCX.
>
> > * tbench shows regressions on Zen3 with each client
> > configuration. tbench on Zen4 shows some improvements when the system is
> > overloaded.
>
> Hmm, I also observed tbench not performing well with SHARED_RUNQ on my
> Zen4 / 7950X, but only with heavy load. It also seems that sharding
> helps a lot for tbench on Zen3, whereas Zen4 performs fine without it.
> I'm having trouble reasoning about why Zen4 wouldn't require sharding
> whereas Zen3 would given that Zen4 has more cores per CCX.

Yes, I have been thinking about it as well. Both Zen3 (Milan) and Zen4
(Bergamo) servers that I ran these tests on have 8 cores per
CCX. Bergamo has 2 CCXes per CCD, whle Milan has 1 CCX per CCD. We
don't model the CCD in the sched-domain hierarchy currently, so from
the point of view of the LLC domain (which is the CCX), the number of
cores between the two systems per LLC are identical.

>
> Just to verify -- these benchmarks were run with boost disabled,
> correct? Otherwise, there could be a lot of run-to-run variance
> depending on thermal throttling.


Checking my scripts, these benchmarks were run with C2 disabled,
performance governor enabled, and acpi-cpufreq as the scaling
governor. Boost was enabled, so yes, there could be run-to-run
variance. I can rerun them this weekend with boost disabled. I also
need to understand the overloaded cases of tbench and netperf where
the shared-runq is performing better.


>
> >
> > * netperf shows minor improvements on Zen3 when the system is under
> > low to moderate load. netperf regresses on Zen3 at high load, and at
> > all load-points on Zen4.
>
> netperf in general seems to regress as the size of the LLC inreases due
> to it relentlessly hammering the runqueue, though it's still surprising
> to me that your Zen4 test showed regressions under low / moderate load
> as well. Was this with -t TCP_RR, or -t UDP_RR? I observed SHARED_RUNQ
> improving performance on my 7950X for -t TCP_RR as described on [0], so
> I'd be curious to better understand where the slowdowns are coming from
> (presumably it's just contending on the shard lock due to having a
> larger CCX?)


I ran netperf with TCP_RR with the server running on localhost. The
exact command is:

netperf -H 127.0.0.1 -t TCP_RR -l 100 -- -r 100 \
-k REQUEST_SIZE,RESPONSE_SIZE,ELAPSED_TIME,THROUGHPUT,THROUGHPUT_UNITS,MIN_LATENCY,MEAN_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MAX_LATENCY,STDDEV_LATENCY

I am yet to debug why we are seeing performance drop in the
low-utilization cases.


>
> [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/
>
> > * Stream, SPECjbb2015 and Mongodb show no significant difference compared
> > to the current tip.
> >
> > * With netperf and tbench, using the shared-runqueue during
> > enqueue_entity performs badly.
>
> My reading of your Zen4 numbers on tbench seem to imply that it actually
> performs well under heavy load. Copying here for convenience:

>
> Zen4, 2 Sockets, 128 cores per socket, SMT2:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Clients: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
> 1 1.00 [ 0.00]( 0.19) 0.98 [ -1.72]( 0.19) 0.99 [ -1.15]( 0.28) 0.98 [ -1.79]( 0.28) 0.99 [ -1.49]( 0.10)
> 2 1.00 [ 0.00]( 0.63) 0.98 [ -2.28]( 0.63) 0.98 [ -1.91]( 0.26) 0.97 [ -3.14]( 0.25) 0.98 [ -1.77]( 0.32)
> 4 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 1.13) 0.99 [ -0.69]( 0.57) 0.98 [ -1.59]( 0.35) 0.99 [ -0.64]( 0.18)
> 8 1.00 [ 0.00]( 1.14) 0.99 [ -0.73]( 0.61) 0.98 [ -2.28]( 2.61) 0.97 [ -2.56]( 0.34) 0.98 [ -1.77]( 0.70)
> 16 1.00 [ 0.00]( 0.98) 0.97 [ -2.54]( 1.24) 0.98 [ -1.71]( 1.86) 0.98 [ -1.53]( 0.62) 0.96 [ -3.56]( 0.93)
> 32 1.00 [ 0.00]( 0.76) 0.98 [ -2.31]( 1.35) 0.98 [ -2.06]( 0.77) 0.96 [ -3.53]( 1.63) 0.88 [-11.72]( 2.77)
> 64 1.00 [ 0.00]( 0.96) 0.96 [ -4.45]( 3.53) 0.97 [ -3.44]( 1.53) 0.96 [ -3.52]( 0.89) 0.31 [-69.03]( 0.64)
> 128 1.00 [ 0.00]( 3.03) 0.95 [ -4.78]( 0.56) 0.98 [ -2.48]( 0.47) 0.92 [ -7.73]( 0.16) 0.20 [-79.75]( 0.24)
> 256 1.00 [ 0.00]( 0.04) 0.93 [ -7.21]( 1.00) 0.94 [ -5.90]( 0.63) 0.59 [-41.29]( 1.76) 0.16 [-83.71]( 0.07)
> 512 1.00 [ 0.00]( 3.08) 1.07 [ 7.07](17.78) 1.15 [ 15.49]( 2.65) 0.82 [-17.53](29.11) 0.93 [ -7.18](32.23)
> 1024 1.00 [ 0.00]( 0.60) 1.16 [ 15.61]( 0.07) 1.16 [ 15.92]( 0.06) 1.12 [ 11.57]( 1.86) 1.12 [ 11.97]( 0.21)
> 2048 1.00 [ 0.00]( 0.16) 1.15 [ 14.62]( 0.90) 1.15 [ 15.20]( 0.29) 1.08 [ 7.64]( 1.44) 1.15 [ 14.57]( 0.23)
>
> I'm also struggling to come up for an explanation for why Zen4 would
> operate well with SHARED_RUNQ under heavy load. Do you have a theory?

Yes. So, my theory is that with SHARED_RUNQ, we delay entering idle
state due to the checking of the shared runqueue while acquiring a
lock. So perhaps this is helping in an unintended manner. I want to
rerun those parts while collecting the idle-statistics.


>
> > Server configurations used:
> >
> > AMD Zen3 Server:
> > * 2 sockets,
> > * 64 cores per socket,
> > * SMT2 enabled
> > * Total of 256 threads.
> > * Configured in Nodes-Per-Socket(NPS)=1
> >
> > AMD Zen4 Server:
> > * 2 sockets,
> > * 128 cores per socket,
> > * SMT2 enabled
> > * Total of 512 threads.
> > * Configured in Nodes-Per-Socket(NPS)=1
> >
> > The trends on NPS=2 and NPS=4 are similar. So I am not posting those.
> >
> >
> > Legend:
> > tip : Tip kernel with top commit ebb83d84e49b
> > ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
> >
> > swqueue_v1 : Your v1 patches applied on top of the aforemenioned tip commit.
> >
> > noshard : shared-runqueue v2 patches 1-5. This uses a shared-runqueue
> > during wakeup. No sharding.
> >
> > shard_wakeup : shared-runqueue v2 patches 1-6. This uses a
> > shared-runqueue during wakeup and has shards with
> > shard size = 6 (default)
> >
> > shard_all : v2 patches 1-7. This uses a sharded shared-runqueue during
> > enqueue_entity
>
> So, what's your overall impression from these numbers? My general
> impression so far is the following:
>
> - SHARED_RUNQ works best when the system would otherwise be
> under-utilized. If the system is going to be overloaded, it's unlikely
> to provide a significant benefit over CFS, and may even just add
> overhead with no benefit (or just cause worse cache locality).


I agree with you here. The only thing that saw a consistent benefit
was hackbench under moderate load.

>
> - SHARED_RUNQ isn't well-suited to workloads such as netperf which
> pummel the scheduler. Sharding helps a lot here, but doesn't
> completely fix the problem depending on how aggressively tasks are
> hammering the runqueue.

Yeah. Even with sharding (which I would assume would definitely help
on platforms with a larger LLC domain), each idle entry would result
in searching the shared-wakequeue and probability of finding something
there is very low if the workload runs for a very short duration.

>
> - To the point above, using SHARED_RUNQ in __enqueue_entity() /
> __dequeue_entity(), rather than just on the wakeup path, is a net
> positive. Workloads which hammer the runq such as netperf or schbench
> -L -m 52 -p 512 -r 10 -t 1 will do poorly in both scenarios, so we may
> as well get the better work conservation from __enqueue_entity() /
> __dequeue_entity(). hackbench is one example of a workload that
> benefits from this, another is kernel compile, and I strongly suspect
> that HHVM would similarly benefit.

Well, the magnitude of the performance degradation is much higher for
tbench and netperf when we have shared_runq being used in the
__enqueue_entity()/__dequeue_entity() path. So it is very workload
dependent. I would like to try out with a variant that used
shared_runq() in the __enqueue/__dequeue_entity() path, without
sharding though. Just to see if it makes any difference.



>
> - Sharding in general doesn't seem to regress performance by much when
> it wouldn't have otherwise been necessary to avoid contention.
> hackbench is better without sharding on Zen3, but it's also better
> with shard_all on Zen4.


>
> In general, if our goal is to better support hosts with large CCXs, I
> think we'll just need to support sharding.

I think the shard size needs to be determined as a function of the LLC
size. Or have the arch specific code pick the size to suit a
particular generation. At least on Zen3, Zen4 servers with 8 cores per
LLC domain, creation of shards was not providing additional benefit.


>
> Thoughts?
>
> I have the v3 version of the patch set which properly supports domain
> recreation and hotplug, but I still need to get updated benchmark
> numbers on it, as well as benchmark spreading a shared_runq over
> multiple CCXs per Peter's comment in [1] about the initial motivation
> behind SIS_NODE also applying to SHARED_RUNQ.

> [1]: https://lore.kernel.org/all/20230711114207.GK3062772@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Based on the various flavors of SIS_NODE that we have experimented
with on the EPYC servers, it seems to work very well when the
probability of finding an idle core/cpu in the wider sched-domain is
higher. In that case, the extra time spent in searching for that idle
core/cpu is justified by the fact that the task gets to run
immediately. However, as the utilization on the system increases, we
are less likely to find an idle core/cpu, the additional time spent
searching does show up as a regression. What we need is a way to limit
the downside in these latter case without lowering that upside that we
see in the low-moderate utilization cases.


>
> Given the points above, I would ideally like to just run the shard_all
> variant and compare that to the numbers I collected on v2 and shared in
> [2]. What do you think?

Would that be a fair comparison, because SIS_NODE only explores a
wider-search during wakeups while shard_all would add task to the
shared_runq even during a regular enqueue.

> There will be tradeoffs no matter what we choose
> to do, but enqueuing / dequeuing in __enqueue_entity() /
> __dequeue_entity() seems to perform the best for workloads that don't
> hammer the runqueue, and sharding seems like a given if we do decide to
> do that.

Or see if we can avoid using shard_runq/SIS_NODE when the probability
of reducing the scheduling latency and improving utilization is
low. In these case, the default scheduling strategy should just work
fine. However, I don't know of any clean way to detect such a
situation. That quest is still on :-)

>
> [2]: https://lore.kernel.org/all/20230710200342.358255-1-void@xxxxxxxxxxxxx/
>
> Thanks,
> David

--
Thanks and Regards
gautham.