Potential scheduler regression

From: Ben Guthro
Date: Wed Jul 05 2017 - 11:42:56 EST


I've been in the process of updating our kernel in our appliance VM
from an old LTS kernel (4.1.y) to something a bit more modern (4.9.y)
- and ran into a performance regression, when our QA team was running
some regression suites.

I bisect this behavior to the following commit, introduced in the 4.9
merge window:

commit 1b568f0aabf280555125bc7cefc08321ff0ebaba
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Mon May 9 10:38:41 2016 +0200

sched/core: Optimize SCHED_SMT

Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>

It seems that this commit can have a performance impact on virtual
machines running on VMWare ESXi,
Now...this seemed strange to me, since it appears that the bulk of the
change comes down to the code in kernel/sched/core.c:


static void sched_init_smt(void)
* We've enumerated all CPUs and will assume that if any CPU
* has SMT siblings, CPU0 will too.
if (cpumask_weight(cpu_smt_mask(0)) > 1)

I have verified that, in this environment, the vCPU presented to the
guest has hyperthreading enabled,
but only presents a single hyperthread.
cpumask_weight(cpu_smt_mask(0) resolves to 1

This is backed up with the cpuinfo, and lscpu output, as well

Results of /proc/cpuinfo for cpu0:

~$ cat /proc/cpuinfo | head -27
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
stepping : 2
microcode : 0x2d
cpu MHz : 2599.732
cache size : 35840 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology
tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma
cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
xsave avx f16c rdrand hypervisor lahf_lm abm epb fsgsbase tsc_adjust
bmi1 avx2 smep bmi2 invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 5199.99
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:

Results of "lscpu" :

~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Stepping: 2
CPU MHz: 2599.732
BogoMIPS: 5199.99
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-3

Now - I suppose that we could just carry around a patch, to revert
this commit, whenever we wanted to update our kernel...but I'd prefer
to understand the problem better - since this is currently falling
into the category of "being able to have progress, or understanding,
but not necessarily both"

In advance of the question - the tip of the tree (v4.12 at an earlier
RC version) was tested, and at that time, no discernable difference
was noticed, from 4.9, WRT this performance regression in our tests.
However - this code remains unchanged AFAICT in v4.12

This is my first dip back into LKML in probably 4 years - so apologies
if this has been previously discussed. I tried to do my research ahead
of time - but either this has not been discussed, or my google-fu was
weak when attempting the search parameters.

Do you happen to know what might be happening here?

Thank you in advance, for any information that you may be able to provide

Ben Guthro
SimpliVity / Hewlett Packard Enterprise