[RFC 00/26] Intel Thread Director Virtualization

From: Zhao Liu
Date: Sat Feb 03 2024 - 04:00:02 EST


From: Zhao Liu <zhao1.liu@xxxxxxxxx>

Hi list,

This is our RFC to virtualize Intel Thread Director (ITD) feature for
Guest, which is based on Ricardo's patch series about ITD related
support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the
virtualization of HFI" [1]).

In short, the purpose of this patch set is to enable the ITD-based
scheduling logic in Guest so that Guest can better schedule Guest tasks
on Intel hybrid platforms.

Currently, ITD is necessary for Windows VMs. Based on ITD virtualization
support, the Windows 11 Guest could have significant performance
improvement (for example, on i9-13900K, up to 14%+ improvement on
3DMARK).

Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs'
CPU affinity. However, in our practice, the ITD scheduling optimization
for win11 VMs works best when combined with hybrid topology and CPU
affinity (this is related to the specific implementation of Win11
scheduling). For more details, please see the Section.1.2 "About hybrid
topology and vCPU pinning".

To enable ITD related scheduling optimization in Win11 VM, some other
thermal related support is also needed (HWP, CPPC), but we could emulate
it with dummy value in the VMM (We'll also be sending out extra patches
in the future for these).

Welcome your feedback!


1. Background and Motivation
============================

1.1. Background
^^^^^^^^^^^^^^^

We have the use case to run games in the client Windows VM as the cloud
gaming solution.

Gaming VMs are performance-sensitive VMs on Client, so that they usually
have two characteristics to ensure interactivity and performance:

i) There will be vCPUs equal to or close to the number of Host pCPUs.

ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve
exclusive resources and avoid the overhead of migration.

In this case, Host can't provide effective scheduling for Guest, so we
need to deliver more hardware-assisted scheduling capabilities to Guest
to enhance Guest's scheduling.

Windows 11 (and future Windows products) is heavily optimized for the
Intel hybrid platform. To get the best performance, we need to
virtualize hybrid scheduling features (HFI/ITD) for Windows Guest.


1.2. About hybrid topology and vCPU pinning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Our ITD virtualization can support most vCPU topologies (except multiple
packages/dies, see details in 3.5 Restrictions on Guest Topology), and
can also support the case of non-pinning vCPUs (i.e. it can handle vCPU
thread migration).

The following is our performance measuremnt on an i9-13900K machine
(2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with
iGPU passthrough, running 3DMARK in Win11 Professional Guest:


compared with smp topo case smp topo smp topo smp topo hybrid topo hybrid topo hybrid topo hybrid topo
+ affinity + ITD + ITD + affinity + ITD + ITD
+ affinity + affinity
Time Spy - Overall 0.179% -0.250% 0.179% -0.107% 0.143% -0.179% -0.107%
Graphics score 0.124% -0.249% 0.124% -0.083% 0.124% -0.166% -0.249%
CPU score 0.916% -0.485% 1.149% -0.076% 0.722% -0.324% 11.915%
Fire Strike Extreme - Overall 0.149% 0.000% 0.224% -1.021% -3.361% -1.319% -3.361%
Graphics score 0.100% 0.050% 0.150% -1.376% -3.427% -1.676% -3.652%
Physics score 5.060% 0.759% 0.518% -2.907% -10.914% -0.897% 14.638%
Combined score 0.120% -0.179% 0.418% 0.060% -2.929% -0.179% -2.809%
Fire Strike - Overall 0.350% -0.085% 0.193% -1.377% -1.365% -1.509% -1.787%
Graphics score 0.256% -0.047% 0.210% -1.527% -1.376% -1.504% -2.320%
Physics score 3.695% -2.180% 0.629% -1.581% -6.846% -1.444% 14.100%
Combined score 0.415% -0.128% 0.128% -0.957% -1.052% -1.594% -0.957%
CPU Profile Max Threads 1.836% 0.298% 1.786% -0.069% 1.545% 0.025% 9.472%
16 Threads 4.290% 0.989% 3.588% 0.595% 1.580% 0.848% 11.295%
8 Threads -22.632% -0.602% -23.167% -0.988% -1.345% -1.340% 8.648%
4 Threads -21.598% 0.449% -21.429% -0.817% 1.951% -0.832% 2.084%
2 Threads -12.912% -0.014% -12.006% -0.481% -0.609% -0.595% 1.161%
1 Threads -3.793% -0.137% -3.793% -0.495% -3.189% -0.495% 1.154%


Based on the above result, we can find exposing only HFI/ITD to win11
VMs without hybrid topology or CPU affinity (case "smp topo + ITD")
won't hurt performance, but would also not get any performance
improvement.

Setting both hybrid topology and CPU affinity for ITD, then win11 VMs
get significate performance improvement (up to 14%+, compared with the
case setting smp topology without CPU affinity).

Not only the numerical results of 3DMARK, but in practice, there is an
significate improvement in the frame rate of the games.

Also, the more powerful the machine, the more significate the
performance gains!

Therefore, the best practice for enabling ITD scheduling optimization
is to set up both CPU affinity and hybrid topology for win11 Guest while
enabling our ITD virtualization.

Our earlier QEMU prototype RFC [2] presented the initial hybrid
topology support for VMs. And currently our another proposal about
"QOM topology" [3] has been raised in the QEMU community, which is the
first step towards the hybrid topology implementation based on QOM
approach.


2. Introduction of HFI and ITD
==============================

Intel provides Hardware Feedback Interface (HFI) feature to allow
hardware to provide guidance to the OS scheduler to perform optimal
workload scheduling through a hardware feedback interface structure in
memory [4]. This HFI structure is called HFI table.

For now, the guidance includes performance and energy efficiency
hints, and it could be update via thermal interrupt as the actual
operating conditions of the processor change during run time.

Intel Thread Director (ITD) feature extends the HFI to provide
performance and energy efficiency data for advanced classes of
instructions.

Since ITD is an extension of HFI, our ITD virtualization also
virtualizes the native HFI feature.


3. Dependencies of ITD
======================

ITD is a thermal FEATURE that requires:
* PTM (Package Thermal Management, alias, PTS)
* HFI (Hardware Feedback Interface)

In order to support the notification mechanism of ITD/HFI dynamic
update, we also need to add thermal interrupt related support,
including the following two features:
* ACPI (Thermal Monitor and Software Controlled Clock Facilities)
* TM (Thermal Monitor, alias, TM1/ACC)

Therefore, we must also consider support for the emulation of all
the above dependencies.


3.1. ACPI emulation
^^^^^^^^^^^^^^^^^^^

For both ACPI, we can support it by emulating the RDMSR/WRMSR of the
associated MSRs and adding the ability to inject thermal interrupts.
But in fact, we don't really inject termal interrupts into Guest for
the termal conditions corresponding to ACPI. Here the termal interrupt
is prepared for the subsequent HFI/ITD.


3.2. TM emulation
^^^^^^^^^^^^^^^^^

TM is a hardware feature and its CPUID bit only indicates the presence
of the automatic thermal monitoring facilities. For TM, there's no
interactive interface between OS and hardware, but its flag is one of
the prerequisites for the OS to enable thermal interrupt.

Thereby, as the support for TM, it is enough for us to expose its CPUID
flag to Guest.


3.3. PTM emulation
^^^^^^^^^^^^^^^^^^

PTM is a package-scope feature that includes package-level MSR and
package-level thermal interrupt. Unfortunately, KVM currently only
supports thread-scope MSR handling, and also doesn't care about the
specific Guest's topology.

But considering that our purpose of supporting PTM in KVM is to further
support ITD, and the current platforms with ITD are all 1 package, so we
emulate the MSRs of the package scope provided by PTM at the VM level.

In this way, the VMM is required to set only one package topology for
the PTM. In order to alleviate this limitation, we only expose the PTM
feature bit to Guest when ITD needs to be supported.


3.4. HFI emulation
^^^^^^^^^^^^^^^^^^

ITD is the extension of HFI, so both HFI and ITD depend on HFI table.
HFI itself is used on the Host for power-related management control, so
we should only expose HFI to Guest when we need to enable ITD.

HFI also relies on PTM interrupt control, so it also has requirements
for package topology, and we also emulate HFI (including ITD) at the VM
level.

In addition, because the HFI driver allocates HFI instances per die,
this also affects HFI (and ITD) and must limit the Guest to only set one
die.


3.5. Restrictions on Guest Topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to KVM's incomplete support for MSR topology and the requirement for
HFI instance management in the kernel, PTM, HFI, and ITD limit the
topology of the Guest (mainly restricting the topology types created on
the VMM side).

Therefore, we only expose PTM, HFI, and ITD to userspace when we need to
support ITD. At the same time, considering that currently, ITD is only
used on the client platform with 1 package and 1 die, such temporary
restrictions will not have too much impact.


4. Overview of ITD (and HFI) virtualization
===========================================

The main tasks of ITD (including HFI) virtualization are:
* maintain a virtual HFI table for VM.
* inject thermal interrupt when HFI table updates.
* handle related MSRs' emulation and adjust HFI table based on MSR's
control bits.
* expose ITD/HFI configuration info in related CPUID leaves.

The most important of these is the maintenance of the virtual HFI table.
Although the HFI table should also be per package, since ITD/HFI related
MSRs are treated as per VM in KVM, we also treat the virtual HFI table
as per VM.


4.1. HFI table building
^^^^^^^^^^^^^^^^^^^^^^^

HFI table contains a table header and many table entries. Each table
entry is identified by an hfi table index, and each CPU corresponds to
one of the hfi table indexes.

ITD and HFI features both depend on the HFI table, but their HFI table
are a little different. The HFI table provided by the ITD feature has
more classes (in terms of more columns in the table) than the HFI table
of native HFI feature.

The virtual HFI table in KVM is built based on the actual HFI table,
which is maintained by HFI instance in HFI driver. We extract the HFI
data of the pCPUs, which vCPUs are running on, to form a virtual HFI
table.


4.2. HFI table index
^^^^^^^^^^^^^^^^^^^^

There are many entries in the HFI table, and the vCPU will be assigned
an HFI table index to specify the entry it maps. KVM will fill the
pCPU's HFI data (the pCPU that vCPU is running on) into the entry
corresponding to the HFI table index of the vCPU in the vcitual HFI
table.

This index is set by VMM in CPUID.


4.3. HFI table updating
^^^^^^^^^^^^^^^^^^^^^^^

On some platforms, the HFI table will be dynamically updated with
thermal interrupts. In order to update the virtual HFI table in time, we
added the per-VM notifier to the HFI driver to notify KVM to update the
virtual HFI table for the VM, and then inject thermal interrupt into the
VM to notify the Guest.

There is another case that needs to update the virtual HFI table, that
is, when the vCPU is migrated, the pCPU where it is located is changed,
and the corresponding virtual HFI data should also be updated to the new
pCPU's data. In this case, in order to reduce overhead, we can only
update the data of a single vPCU without traversing the entire virtual
HFI table.


5. Patch Summary
================

Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data
structures that KVM needs.
Patch 04-05: Add the sched_out arch hook and reset the classification
history at sched_in()/schedu_out().
Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and
related MSRs.
Patch 11-20: Add the emulation support for HFI, including maintaining
the HFI table for VM.
Patch 21-23: Add the emulation support for ITD, including extending HFI
to ITD and passing through the classification MSRs.
Patch 24-25: Add HRESET emulation support, which is also used by IPC
classes feature.
Patch 26: Add the brief doc about the per-VM lock - pkg_therm_lock.


6. References
=============

[1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI
https://lore.kernel.org/lkml/20240203040515.23947-1-ricardo.neri-calderon@xxxxxxxxxxxxxxx/
[2]: [RFC 00/52] Introduce hybrid CPU topology,
https://lore.kernel.org/qemu-devel/20230213095035.158240-1-zhao1.liu@xxxxxxxxxxxxxxx/
[3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology,
https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@xxxxxxxxxxxxxxx/
[4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL
THREAD DIRECTOR


Thanks and Best Regards,
Zhao
---
Zhao Liu (17):
thermal: Add bit definition for x86 thermal related MSRs
KVM: Add kvm_arch_sched_out() hook
KVM: x86: Reset hardware history at vCPU's sched_in/out
KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits
KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask()
KVM: VMX: Introduce HFI description structure
KVM: VMX: Introduce HFI table index for vCPU
KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops
KVM: VMX: Allow to inject thermal interrupt without HFI update
KVM: VMX: Emulate HFI related bits in package thermal MSRs
KVM: VMX: Emulate the MSRs of HFI feature
KVM: x86: Expose HFI feature bit and HFI info in CPUID
KVM: VMX: Extend HFI table and MSR emulation to support ITD
KVM: VMX: Pass through ITD classification related MSRs to Guest
KVM: x86: Expose ITD feature bit and related info in CPUID
KVM: VMX: Emulate the MSR of HRESET feature
Documentation: KVM: Add description of pkg_therm_lock

Zhuocheng Ding (9):
thermal: intel: hfi: Add helpers to build HFI/ITD structures
thermal: intel: hfi: Add HFI notifier helpers to notify HFI update
KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature
KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM
KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature
KVM: VMX: Support virtual HFI table for VM
KVM: VMX: Sync update of Host HFI table to Guest
KVM: VMX: Update HFI table when vCPU migrates
KVM: x86: Expose HRESET feature's CPUID to Guest

Documentation/virt/kvm/locking.rst | 13 +-
arch/arm64/include/asm/kvm_host.h | 1 +
arch/mips/include/asm/kvm_host.h | 1 +
arch/powerpc/include/asm/kvm_host.h | 1 +
arch/riscv/include/asm/kvm_host.h | 1 +
arch/s390/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/hfi.h | 28 ++
arch/x86/include/asm/kvm-x86-ops.h | 3 +-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/msr-index.h | 54 +-
arch/x86/kvm/cpuid.c | 201 +++++++-
arch/x86/kvm/irq.h | 1 +
arch/x86/kvm/lapic.c | 9 +
arch/x86/kvm/svm/svm.c | 8 +
arch/x86/kvm/vmx/vmx.c | 751 +++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 79 ++-
arch/x86/kvm/x86.c | 18 +
drivers/thermal/intel/intel_hfi.c | 212 +++++++-
drivers/thermal/intel/therm_throt.c | 1 -
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 1 +
21 files changed, 1343 insertions(+), 44 deletions(-)

--
2.34.1