[PATCH V4 0/4] KVM: x86: Make bus clock frequency for vAPIC timer configurable

From: Reinette Chatre
Date: Thu Mar 21 2024 - 12:46:26 EST


(I am helping Isaku for a bit by submitting the next versions of this
series.)

Changes from v3:
- v3: https://lore.kernel.org/all/cover.1702974319.git.isaku.yamahata@xxxxxxxxx/
- Reworked all changelogs.
- Regarding code changes: patches #1 and #2 are unchanged, patch #3 was
reworked according to Sean's guidance, and patch #4 (the test)
needed changes after rebase to kvm-x86/next and the implementation
(patch #3) changes.
- Added Reviewed-by tags to patches #1, #2, and #4, but removed
Maxim's Reviewed-by tag from patch #3 because it changed so much.
- Added a "Suggested-by" to patch #4 to reflect that it represents
Sean's guidance.
- Reworked cover to match customs (in subject line) and reflect feedback
to patches: capability renamed to KVM_CAP_X86_APIC_BUS_FREQUENCY, clarify
distinction between "core crystal clock" and "bus clock", and update
pro/con list.
- Please refer to individual patches for detailed changes.

Changes from v2:
- v2: https://lore.kernel.org/lkml/cover.1699936040.git.isaku.yamahata@xxxxxxxxx/
- Removed APIC_BUS_FREQUENCY and apic_bus_frequency of struct kvm-arch.
- Update the commit messages.
- Added reviewed-by (Maxim Levitsky)
- Use 1.5GHz instead of 1GHz as frequency for the test case.

Changes from v1:
- v1: https://lore.kernel.org/all/cover.1699383993.git.isaku.yamahata@xxxxxxxxx/
- Added a test case
- Fix a build error for i386 platform
- Add check if vcpu isn't created.
- Add check if lapic chip is in-kernel emulation.
- Updated api.rst

Summary
-------
Add KVM_CAP_X86_APIC_BUS_FREQUENCY capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_FREQUENCY) to set the
frequency in nanoseconds. When using this capability, the user space
VMM should configure CPUID leaf 0x15 to advertise the frequency.

Description
-----------
Vishal reported [1] that the TDX guest kernel expects a 25MHz APIC bus
frequency but ends up getting interrupts at a significantly higher rate.

The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.

In addition, per Intel SDM:
"The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is
enumerated in CPUID leaf 0x15) divided by the value specified in
the divide configuration register."

The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.

The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
enumerated, the guest kernel uses it as the APIC bus frequency. If not,
the guest kernel measures the frequency based on other known timers like
the ACPI timer or the legacy PIT. As reported in [1] the TDX guest kernel
expects a 25MHz timer frequency but gets timer interrupt more frequently
due to the 1GHz frequency used by KVM.

To ensure that the guest doesn't have a conflicting view of the APIC bus
frequency, allow the userspace to tell KVM to use the same frequency that
TDX mandates instead of the default 1Ghz.

There are several options to address this:
1. Make the KVM able to configure APIC bus frequency (this series).
Pro: It resembles the existing hardware. The recent Intel CPUs
adopts 25MHz.
Con: Require the VMM to emulate the APIC timer at 25MHz.
2. Make the TDX architecture enumerate CPUID leaf 0x15 to configurable
frequency or not enumerate it.
Pro: Any APIC bus frequency is allowed.
Con: Deviates from TDX architecture.
3. Make the TDX guest kernel use 1GHz when it's running on KVM.
Con: The kernel ignores CPUID leaf 0x15.
4. Change CPUID leaf 0x15 under TDX to report the crystal clock frequency
as 1 GHz.
Pro: This has been the virtual APIC frequency for KVM guests for 13
years.
Pro: This requires changing only one hard-coded constant in TDX.
Con: It doesn't work with other VMMs as TDX isn't specific to KVM.
Con: Core crystal clock frequency is also used to calculate TSC frequency.
Con: If it is configured to value different from hardware, it will
break the correctness of INTEL-PT Mini Time Count (MTC) packets
in TDs.

[1] https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@xxxxxxxxxx/

Isaku Yamahata (4):
KVM: x86: hyper-v: Calculate APIC bus frequency for Hyper-V
KVM: x86: Make nsec per APIC bus cycle a VM variable
KVM: x86: Add a capability to configure bus frequency for APIC timer
KVM: selftests: Add test for configure of x86 APIC bus frequency

Documentation/virt/kvm/api.rst | 17 ++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/hyperv.c | 3 +-
arch/x86/kvm/lapic.c | 6 +-
arch/x86/kvm/lapic.h | 3 +-
arch/x86/kvm/x86.c | 28 +++
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/apic.h | 7 +
.../kvm/x86_64/apic_bus_clock_test.c | 166 ++++++++++++++++++
10 files changed, 228 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c


base-commit: 964d0c614c7f71917305a5afdca9178fe8231434
--
2.34.1