Performance Regression in Linux Kernel 5.19

From: Manikandan Jagatheesan
Date: Fri Sep 09 2022 - 07:46:20 EST

Next message: xkernel . wang: "[PATCH v6] staging: r8188eu: fix a potential memory leak in rtw_init_cmd_priv()"
Previous message: Jonas Oberhauser: "RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models""
Next in thread: Peter Zijlstra: "Re: Performance Regression in Linux Kernel 5.19"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

As part of VMware's performance regression testing for Linux
Kernel upstream releases, we have evaluated the performance
of Linux kernel 5.19 against the 5.18 release and we have
noticed performance regressions in Linux VMs on ESXi as shown
below.
- Compute(up to -70%)
- Networking(up to -30%)
- Storage(up to -13%)

After performing the bisect between kernel 5.18 and 5.19, we
identified the root cause to be the enablement of IBRS mitigation
for spectre_v2 vulnerability by commit 6ad0ad2bf8a6 ("x86/bugs:
Report Intel retbleed vulnerability").

To confirm this, we have disabled the above security mitigation
through kernel boot parameter(spectre_v2=off) in 5.19 and re-ran
our tests & confirmed that the performance was on-par with
5.18 release.

Performance data and workload details:
=========================
Used Linux VM on ESXi host: Ubuntu20.04.3

ESXi Compute workloads:
----------------------------
Server configs: 112 threads, 4 sockets Skylake with 2TB memory
1. Boot-halt test:
- Configs: Single VM with different CPU and Memory configurations
(1vCPU_32gb, 28vCPU_256gb, 56vCPU_512gb, 84vCPU_1024gb
& 112vCPU_1433gb)
- Test-desc: Measures the time taken by the Guest to boot up and
shut down itself. We have "shutdown -h now" in
rc.local for Linux. Boothalt time is calculated by
using timestamps of following patterns from vmware.log.
* Begin Pattern - " PowerOn"
* End Pattern - "VMX exit"
- Boothalt time = Timestamp(End Pattern) - Timestamp(Begin Pattern)
- Highly affected case: Lower vCPU config is affected (1vCPU_32gb
up to -12%)
- Metric: Secs
- Performance data:
* Immediate before commit: 14.844 secs
* Intel retbleed/IBRS commit: 16.29 secs (absolute diff ~2 secs)

2. Kernel Compile test:
- Configs: Single VM with different CPU and Memory configurations
(1vCPU_4gb, 28vCPU_64gb, 56vCPU_64gb, 84vCPU_64gb,
112vCPU_64gb & 126vCPU_64gb)
- Test-desc: A CPU intensive benchmark. Measures time taken to compile
Linux kernel source (4.9.24).
- Highly affected case: Higher vCPU configs - 112vCPU_64gb (up to -10%)
- Command: make -j 2x$VCPU. This uses all the available CPU threads to
achieve 100% CPU utilization.
Timestamp is recorded in the vmware.log before and after
compiling the source.
* Begin Pattern - "VMQARESULT BEGIN"
* End Pattern - "VMQARESULT END"
- Metric: Secs
- Performance data:
* Immediate before commit: 21.316 secs
* Intel retbleed/IBRS commit: 23.824secs (absolute diff ~2 secs)

3. OSbench test:
- Configs: Single VM with 1vCPU_4gb config
- Test-desc: This is a collection of benchmarks that aim to measure
the performance of operating system primitives, such as
process and thread creation and it is publicly available.
(https://www.bitsnbites.eu/benchmarking-os-primitives)
git- https://github.com/mbitsnbites/osbench#readme
To build the benchmarks, we need a C compiler, meson
and ninja.
- Highly affected case: 1vCPU_4gb (up to -70%)
- Command: To run - ./create_threads
- Metric: Milliseconds
- Performance data:
i) create_threads
* Immediate before commit: 16.46 msecs
* Intel retbleed/IBRS commit: 27.97 msecs (absolute diff ~11 msecs)
ii) create_processes
* Immediate before commit: 69.03 msecs
* Intel retbleed/IBRS commit: 83.20 msecs (absolute diff ~14 msecs)

ESXi Networking workloads:
------------------------------
- Server config: 56 threads 2 sockets Skylake with 192G memory
- Benchmark: Netperf 2.7.0
- Topology: A Linux VM on an ESXi host is connected to a Bare Metal
Linux client using back to back direct connection without
involving a physical switch.
- Test-Desc: We measure bulk data transfer and request/response
performance using TCP and UDP protocols.
- Highly affected case: Single VM on 8vCPU with TCP_STREAM RECV
Large packets(256K Socket & 16K Message size)
up to -30%
- Netperf command: (TCP_STREAM_RECV large packets)
netperf -l 60 -H DestinationIP -p port -t TCP_STREAM -- -s 256K
-S 256K -m 16K -M 16K
Linux VM on the ESXi host act as RECEIVER and Bare Metal
Linux host act as SENDER.
We initiate netperf from Bare Metal Client Linux host and start
netserver from Linux VM on the ESXi host with 16 parallel netperf
streams.
- Metrics: TCP_STREAM(Cpu/Gbits, Gbps), UDP_STREAM(Kilo packets per
second), TCP_RR(ResponseTime in microseconds)
TCP_STREAM_Throughput - Capture Throughput from netperf output file.
TCP_STREAM_CPU - Capture CPU/Gbits from Total CPU spent in all
of the threads in given duration divided by
respective throughput Gbps.
UDP_STREAM Msgs - Capture from netstats & netperf out files.
TCP_RR RespTimeMean - Capture output from netperf out file.
- NIC Model used: Intel(R) Ethernet Controller XL710 for 40GbE QSFP+
- Performance data:
* Immediate before commit: 11.932 Gbps
* Intel retbleed/IBRS commit: 8.56 Gbps (~3.5 Gbps of throughput drop)

ESXi Storage workloads:
--------------------------
- Server config: 56 threads 2 sockets Skylake with 192G memory
- Benchmark: FIO v3.20
- Test-Desc: We measure how much read/write I/O operations can be
performed at a given period of time, average time it
takes to complete the I/O and the total CPU cycles
been spent.
- I/O Block size: 4KiB, 64KiB & 256KiB
- Read write Ratio: 100% read, 100% write & 70/30 mixed readwrite
- Access Patterns: Random & Sequential
- # of VMs: Single VM (1VM_8vCPU) & Multi VMs(16VM_4vCPU)
- Devices under test: Local device and SAN
- Local device: Local NVMe (Intel Corporation DC P3700 SSD)
- SAN connected: QLogic QLE2692 FC-16G (connected to DELL EMC
PowerStore 5000T array)
- Highly affected case: 1VM-cpucost_64K_seq_7030readwrite (up to -13%)
- Throughput and latency tests are not affected.
- Command: fio --name=fio-test --ioengine=libaio --iodepth=16 --rw=rw
--rwmixread=70 --rwmixwrite=30 --bs=65536 --thread --direct=1
--numjobs=8 --group_reporting=1 --time_based --runtime=180
--filename=/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:
/dev/sdg:/dev/sdh:/dev/sdi --significant_figures=10
- Metrics: Throughput (IOPS), Latency (milliseconds) and Cpucost
(CPIO - cycles per I/O) t
The new CPIO (internal tool) is implemented simply as a
python script, that uses a processor’s performance counters
to arrive at the CPU cycles used in a given duration.
- Command: python3 /usr/lib/vmware/cpio/cpio.pyc -i 25 -n 5 -D all
-v -d -o outputDir
here, 25 is the interval of collection
5 is the number of intervals
all is the device for which we intend to collect data.
- Topology: A standalone server(ESXi image) with local NVMe disks and
FC-16G HBA is connected to an “DELL EMC PowerStore 5000T”
array for Storage I/O performance measurements.
- Performance data:
* Immediate before commit: 269928 cycles/io
* Intel retbleed/IBRS commit: 303937 cycles/io (absolute
diff 34009 cycles/io)

We believe these findings would be useful to the Linux community and
wanted to document the same.

Manikandan Jagatheesan
Performance Engineering
VMware, Inc.

Next message: xkernel . wang: "[PATCH v6] staging: r8188eu: fix a potential memory leak in rtw_init_cmd_priv()"
Previous message: Jonas Oberhauser: "RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models""
Next in thread: Peter Zijlstra: "Re: Performance Regression in Linux Kernel 5.19"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]