[Linux Kernel 5.13 GA] ESXi Performance regression

From: Abdul Anshad Azeez
Date: Fri Jul 30 2021 - 08:27:32 EST


As part of VMware's performance regression testing for Linux Kernel
upstream releases, we evaluated the performance of Linux kernel 5.13
against the 5.12 release. Our evaluation revealed performance
regressions in ESXi Compute workloads up to 3x and ESXi Networking
workloads up to 40%.

After performing the bisect between kernel 5.13 and 5.12, we
identified the root cause behavior to be a “Scheduler” related commit
from Peter Zijlstra's "8a99b6833c884fa0e7919030d93fecedc69fc625 (
sched: Move SCHED_DEBUG sysctl to debugfs)". It appears that the
issue arose due to Peter's commit changing the default value of
"sched_wakeup_granularity_ns" and more details are below.

Impacted test case details:

1. Compute:
- VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory
- Benchmark - kernel compile
- Measures time taken to compile Linux kernel source code (Linux
kernel version used - 4.9.24)
- make -j 2xVCPU - This uses all the available CPU threads to achieve
100% CPU utilization

2. Networking:
- VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory and 8VM with
4vCPU & 8G Memory
- Benchmark - Netperf
- Netperf TCP_STREAM RECV small (8K socket & 256B message)(
TCP_NODELAY set) packets – Throughput (1VM)
- Netperf UDP_STREAM RECV (256K socket & 256B message) – Packet rate (
8VM)

>From our testing, overall results indicate that the above-mentioned
commit has introduced performance regressions in kernel compile
workload for Compute area and in Networking, test cases with high
packet rates were impacted.

We noticed that Peter Zijlstra's commit has moved the Scheduler
tunables to debugfs file system. And on taking a closer look, the
values of two such tunables are different between before and after
the above-mentioned commit.

1. Before:
sched_min_granularity_ns - 10000000 (10ms)
sched_wakeup_granularity_ns - 15000000 (15ms)

2. After:
sched_min_granularity_ns - 3000000 (3ms)
sched_wakeup_granularity_ns - 4000000 (4ms)

With further experiments, we have confirmed that the value of
"sched_wakeup_granularity_ns" is influencing these performance
regressions. And, on setting the "sched_wakeup_granularity_ns" value
back to "15000000" in Peter Zijlstra's commit, we are able to gain
back the lost performance in our Compute & Networking workloads.

Further, we also collected guest scheduling stats (during Kernel
compile workload) and were able to notice more involuntary switches
forced by the scheduler when "sched_wakeup_granularity_ns" value is
set to "4000000".

1. "sched_wakeup_granularity_ns = 4000000" (3 iterations):
nr_involuntary_switches : 3
nr_involuntary_switches : 2
nr_involuntary_switches : 2

2. "sched_wakeup_granularity_ns = 15000000" (3 iterations):
nr_involuntary_switches : 0
nr_involuntary_switches : 0
nr_involuntary_switches : 0

So, we believe decreasing the value of "sched_wakeup_granularity_ns"
is causing more preemption to the running processes and it's
impacting the CPU-bound tasks - Kernel compile & Netperf high packet
rate workloads.

Also, since Linux 5.14-rc3 kernel was recently released, we repeated
the same experiments on 5.14-rc3 and were able to observe the same
regressions in both areas (Compute & Networking).

We wanted to understand the reason behind the change in default
values for the above two scheduler tunables and since changing the
value of "sched_wakeup_granularity_ns" from 15ms to 4ms forces more
involuntary switches and which in-turn introduces performance
regression, can this be changed back to 15ms?

Abdul Anshad Azeez
Performance Engineering
VMware, Inc.