Re: [RFC][PATCH 00/16] sched: Core scheduling

From: Ingo Molnar
Date: Tue Feb 19 2019 - 10:15:43 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> >
> > It's just that people have been bugging me for this crap; and I figure
> > I'd post it now that it's not exploding anymore and let others have at.
>
> The patches didn't look disgusting to me, but I admittedly just
> scanned through them quickly.
>
> Are there downsides (maintenance and/or performance) when core
> scheduling _isn't_ enabled? I guess if it's not a maintenance or
> performance nightmare when off, it's ok to just give people the
> option.

So this bit is the main straight-line performance impact when the
CONFIG_SCHED_CORE Kconfig feature is present (which I expect distros to
enable broadly):

+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}

static inline raw_spinlock_t *rq_lockp(struct rq *rq)
{
+ if (sched_core_enabled(rq))
+ return &rq->core->__lock
+
return &rq->__lock;


This should at least in principe keep the runtime overhead down to more
NOPs and a bit bigger instruction cache footprint - modulo compiler
shenanigans.

Here's the code generation impact on x86-64 defconfig:

text data bss dec hex filename
228 48 0 276 114 sched.core.n/cpufreq.o (ex sched.core.n/built-in.a)
228 48 0 276 114 sched.core.y/cpufreq.o (ex sched.core.y/built-in.a)

4438 96 0 4534 11b6 sched.core.n/completion.o (ex sched.core.n/built-in.a)
4438 96 0 4534 11b6 sched.core.y/completion.o (ex sched.core.y/built-in.a)

2167 2428 0 4595 11f3 sched.core.n/cpuacct.o (ex sched.core.n/built-in.a)
2167 2428 0 4595 11f3 sched.core.y/cpuacct.o (ex sched.core.y/built-in.a)

61099 22114 488 83701 146f5 sched.core.n/core.o (ex sched.core.n/built-in.a)
70541 25370 508 96419 178a3 sched.core.y/core.o (ex sched.core.y/built-in.a)

3262 6272 0 9534 253e sched.core.n/wait_bit.o (ex sched.core.n/built-in.a)
3262 6272 0 9534 253e sched.core.y/wait_bit.o (ex sched.core.y/built-in.a)

12235 341 96 12672 3180 sched.core.n/rt.o (ex sched.core.n/built-in.a)
13073 917 96 14086 3706 sched.core.y/rt.o (ex sched.core.y/built-in.a)

10293 477 1928 12698 319a sched.core.n/topology.o (ex sched.core.n/built-in.a)
10363 509 1928 12800 3200 sched.core.y/topology.o (ex sched.core.y/built-in.a)

886 24 0 910 38e sched.core.n/cpupri.o (ex sched.core.n/built-in.a)
886 24 0 910 38e sched.core.y/cpupri.o (ex sched.core.y/built-in.a)

1061 64 0 1125 465 sched.core.n/stop_task.o (ex sched.core.n/built-in.a)
1077 128 0 1205 4b5 sched.core.y/stop_task.o (ex sched.core.y/built-in.a)

18443 365 24 18832 4990 sched.core.n/deadline.o (ex sched.core.n/built-in.a)
20019 2189 24 22232 56d8 sched.core.y/deadline.o (ex sched.core.y/built-in.a)

1123 8 64 1195 4ab sched.core.n/loadavg.o (ex sched.core.n/built-in.a)
1123 8 64 1195 4ab sched.core.y/loadavg.o (ex sched.core.y/built-in.a)

1323 8 0 1331 533 sched.core.n/stats.o (ex sched.core.n/built-in.a)
1323 8 0 1331 533 sched.core.y/stats.o (ex sched.core.y/built-in.a)

1282 164 32 1478 5c6 sched.core.n/isolation.o (ex sched.core.n/built-in.a)
1282 164 32 1478 5c6 sched.core.y/isolation.o (ex sched.core.y/built-in.a)

1564 36 0 1600 640 sched.core.n/cpudeadline.o (ex sched.core.n/built-in.a)
1564 36 0 1600 640 sched.core.y/cpudeadline.o (ex sched.core.y/built-in.a)

1640 56 0 1696 6a0 sched.core.n/swait.o (ex sched.core.n/built-in.a)
1640 56 0 1696 6a0 sched.core.y/swait.o (ex sched.core.y/built-in.a)

1859 244 32 2135 857 sched.core.n/clock.o (ex sched.core.n/built-in.a)
1859 244 32 2135 857 sched.core.y/clock.o (ex sched.core.y/built-in.a)

2339 8 0 2347 92b sched.core.n/cputime.o (ex sched.core.n/built-in.a)
2339 8 0 2347 92b sched.core.y/cputime.o (ex sched.core.y/built-in.a)

3014 32 0 3046 be6 sched.core.n/membarrier.o (ex sched.core.n/built-in.a)
3014 32 0 3046 be6 sched.core.y/membarrier.o (ex sched.core.y/built-in.a)

50027 964 96 51087 c78f sched.core.n/fair.o (ex sched.core.n/built-in.a)
51537 2484 96 54117 d365 sched.core.y/fair.o (ex sched.core.y/built-in.a)

3192 220 0 3412 d54 sched.core.n/idle.o (ex sched.core.n/built-in.a)
3276 252 0 3528 dc8 sched.core.y/idle.o (ex sched.core.y/built-in.a)

3633 0 0 3633 e31 sched.core.n/pelt.o (ex sched.core.n/built-in.a)
3633 0 0 3633 e31 sched.core.y/pelt.o (ex sched.core.y/built-in.a)

3794 160 0 3954 f72 sched.core.n/wait.o (ex sched.core.n/built-in.a)
3794 160 0 3954 f72 sched.core.y/wait.o (ex sched.core.y/built-in.a)

I'd say this one is representative:

text data bss dec hex filename
12235 341 96 12672 3180 sched.core.n/rt.o (ex sched.core.n/built-in.a)
13073 917 96 14086 3706 sched.core.y/rt.o (ex sched.core.y/built-in.a)

which ~6% bloat is primarily due to the higher rq-lock inlining overhead,
I believe.

This is roughly what you'd expect from a change wrapping all 350+ inlined
instantiations of rq->lock uses. I.e. it might make sense to uninline it.

In terms of long term maintenance overhead, ignoring the overhead of the
core-scheduling feature itself, the rq-lock wrappery is the biggest
ugliness, the rest is mostly isolated.

So if this actually *works* and improves the performance of some real
VMEXIT-poor SMT workloads and allows the enabling of HyperThreading with
untrusted VMs without inviting thousands of guest roots then I'm
cautiously in support of it.

> That all assumes that it works at all for the people who are clamoring
> for this feature, but I guess they can run some loads on it eventually.
> It's a holiday in the US right now ("Presidents' Day"), but maybe we
> can get some numebrs this week?

Such numbers would be *very* helpful indeed.

Thanks,

Ingo