[PATCH 00/54] KVM: x86/mmu: Bug fixes and summer cleaning

From: Sean Christopherson
Date: Tue Jun 22 2021 - 13:58:00 EST


I missed spring by a few days...

This gigantic snowball got rolling when I hit the WARN that guards against
setting bits 63:32 in SPTEs on 32-bit builds (patch 01). The WARN is a
boneheaded mistake on my part as the whole point of EPT is to avoid the
mess that is IA32 paging.

I added a better variant to WARN if KVM attempts to set _any_ reserved bits
in its SPTEs (patch 48). Unfortunately, the WARN worked too well and fired
on variety of configurations. The patches in between are a mix of bug fixes,
cleanups, and documentation updates to get KVM to the point where the WARN
can be added without causing explosions, and to fix/document the numerous
issues/gotchas I found along the way.

The meat of this series is big refactoring of the MMU configuration code to
fix nested NPT, which I discovered after writing a test to exercise the
new reserved bit WARN. With nested NPT, vCPU state is not guaranteed to
reflect vmcb01 state (though in practice the bug is limited to
KVM_SET_NESTED_STATE, i.e. live migration). KVM passes in the L1 CR0, CR4,
and EFER values, which the MMU takes into consideration for the mmu_role
and then promptly ignores for all other calculations, e.g. reserved bits.

The approach for solving the nested NPT mess, and a variety of other minor
bugs of similar nature, is to take "all" state from the MMU context itself
instead of the vCPU. None of the refactoring patches are particularly
interesting, there's just a lot of them because so much code uses the vCPU
instead of the correct state.

I have kvm-unit-tests for the SMEP, NX, and LA57 (my personal favorite) bugs
that I'll post separately. Ditto for a selftest for recomputing the mmu_role
on CPUID updates.

I don't have a standalone test for nested NPT mmu_role changes; adding a
meaningful test mixed with KVM_SET_NESTED_STATE is a bigger lift. To test
that mess, I randomized vCPU state prior to initializing the nested NPT MMU
and ran kvm-unit-tests. E.g. Without the mmu_role changes this causes a
number of unit test failures:

vcpu->arch.cr0 = get_random_long();
vcpu->arch.cr4 = get_random_long();
vcpu->arch.efer = get_random_long();

kvm_init_shadow_npt_mmu(vcpu, X86_CR0_PG, svm->vmcb01.ptr->save.cr4,
svm->vmcb01.ptr->save.efer,
svm->nested.ctl.nested_cr3);


Patch 01 is the only patch that is remotely 5.13 worthy, and even then
only because it's about as safe as a patch can be. Everything else is far
from urgent as these bugs have existed for quite some time.

I labeled the "sections" of this mess in the shortlog below.

P.S. Does anyone know how PKRU interacts with NPT? I assume/hope NPT
accesses, which are always "user", ignore PKRU, but the APM doesn't
say a thing. If PKRU is ignored, KVM has some fixing to do. If PKRU
isn't ignored, AMD has some fixing to do :-)

P.S.S. This series pulled in one patch from my vCPU RESET/INIT series,
"Properly reset MMU context at vCPU RESET/INIT", as that was needed
to fix a root_level bug on VMX. My goal is to get the RESET/INIT
series refreshed later this week and thoroughly bombard everyone.


Sean Christopherson (54):

-- bug fixes --
KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested
EPT
KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs
KVM: x86: Properly reset MMU context at vCPU RESET/INIT
KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT
walk
Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"
KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is
broken
Revert "KVM: MMU: record maximum physical address width in
kvm_mmu_extended_role"
KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at
GFN

-- cleanups --
KVM: x86/mmu: Replace EPT shadow page shenanigans with simpler check
KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatches
KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page()
KVM: x86/mmu: Rename unsync helper and update related comments

-- bug fixes --
KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER
KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU
state
KVM: x86/mmu: Drop smep_andnot_wp check from "uses NX" for shadow MMUs
KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper

-- nested NPT / mmu_role refactoring --
KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU proper
KVM: x86/mmu: Grab shadow root level from mmu_role for shadow MMUs
KVM: x86/mmu: Add struct and helpers to retrieve MMU role bits from
regs
KVM: x86/mmu: Consolidate misc updates into shadow_mmu_init_context()
KVM: x86/mmu: Ignore CR0 and CR4 bits in nested EPT MMU role
KVM: x86/mmu: Use MMU's role_regs, not vCPU state, to compute mmu_role
KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans
KVM: x86/mmu: Add helpers to query mmu_role bits
KVM: x86/mmu: Do not set paging-related bits in MMU role if CR0.PG=0
KVM: x86/mmu: Set CR4.PKE/LA57 in MMU role iff long mode is active
KVM: x86/mmu: Always Set new mmu_role immediately after checking old
role
KVM: x86/mmu: Don't grab CR4.PSE for calculating shadow reserved bits
KVM: x86/mmu: Use MMU's role to get CR4.PSE for computing rsvd bits
KVM: x86/mmu: Drop vCPU param from reserved bits calculator
KVM: x86/mmu: Use MMU's role to compute permission bitmask
KVM: x86/mmu: Use MMU's role to compute PKRU bitmask
KVM: x86/mmu: Use MMU's roles to compute last non-leaf level
KVM: x86/mmu: Use MMU's role to detect EFER.NX in guest page walk
KVM: x86/mmu: Use MMU's role/role_regs to compute context's metadata
KVM: x86/mmu: Use MMU's role to get EFER.NX during MMU configuration
KVM: x86/mmu: Drop "nx" from MMU context now that there are no readers
KVM: x86/mmu: Get nested MMU's root level from the MMU's role
KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
KVM: x86/mmu: Add helper to update paging metadata
KVM: x86/mmu: Add a helper to calculate root from role_regs
KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
KVM: x86/mmu: Use MMU's role to determine PTTYPE

-- finally, the new WARN!
KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic
MMU
KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE

-- more cleanups --
KVM: x86: Enhance comments for MMU roles and nested transition
trickiness
KVM: x86/mmu: Optimize and clean up so called "last nonleaf level"
logic
KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault

-- RFC-ish "fix" --
KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP
is on


Sean Christopherson (54):
KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested
EPT
KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs
KVM: x86: Properly reset MMU context at vCPU RESET/INIT
KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT
walk
Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"
KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified
KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is
broken
Revert "KVM: MMU: record maximum physical address width in
kvm_mmu_extended_role"
KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at
GFN
KVM: x86/mmu: Replace EPT shadow page shenanigans with simpler check
KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatches
KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page()
KVM: x86/mmu: Rename unsync helper and update related comments
KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER
KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU
state
KVM: x86/mmu: Drop smep_andnot_wp check from "uses NX" for shadow MMUs
KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper
KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU proper
KVM: x86/mmu: Grab shadow root level from mmu_role for shadow MMUs
KVM: x86/mmu: Add struct and helpers to retrieve MMU role bits from
regs
KVM: x86/mmu: Consolidate misc updates into shadow_mmu_init_context()
KVM: x86/mmu: Ignore CR0 and CR4 bits in nested EPT MMU role
KVM: x86/mmu: Use MMU's role_regs, not vCPU state, to compute mmu_role
KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans
KVM: x86/mmu: Add helpers to query mmu_role bits
KVM: x86/mmu: Do not set paging-related bits in MMU role if CR0.PG=0
KVM: x86/mmu: Set CR4.PKE/LA57 in MMU role iff long mode is active
KVM: x86/mmu: Always Set new mmu_role immediately after checking old
role
KVM: x86/mmu: Don't grab CR4.PSE for calculating shadow reserved bits
KVM: x86/mmu: Use MMU's role to get CR4.PSE for computing rsvd bits
KVM: x86/mmu: Drop vCPU param from reserved bits calculator
KVM: x86/mmu: Use MMU's role to compute permission bitmask
KVM: x86/mmu: Use MMU's role to compute PKRU bitmask
KVM: x86/mmu: Use MMU's roles to compute last non-leaf level
KVM: x86/mmu: Use MMU's role to detect EFER.NX in guest page walk
KVM: x86/mmu: Use MMU's role/role_regs to compute context's metadata
KVM: x86/mmu: Use MMU's role to get EFER.NX during MMU configuration
KVM: x86/mmu: Drop "nx" from MMU context now that there are no readers
KVM: x86/mmu: Get nested MMU's root level from the MMU's role
KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
KVM: x86/mmu: Add helper to update paging metadata
KVM: x86/mmu: Add a helper to calculate root from role_regs
KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
KVM: x86/mmu: Use MMU's role to determine PTTYPE
KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic
MMU
KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
KVM: x86: Enhance comments for MMU roles and nested transition
trickiness
KVM: x86/mmu: Optimize and clean up so called "last nonleaf level"
logic
KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP
is on

Documentation/virt/kvm/api.rst | 11 +-
Documentation/virt/kvm/mmu.rst | 7 +-
arch/x86/include/asm/kvm_host.h | 71 ++-
arch/x86/kvm/cpuid.c | 6 +-
arch/x86/kvm/mmu.h | 18 +-
arch/x86/kvm/mmu/mmu.c | 648 +++++++++++-----------
arch/x86/kvm/mmu/mmu_internal.h | 3 +-
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 68 ++-
arch/x86/kvm/mmu/spte.c | 22 +-
arch/x86/kvm/mmu/spte.h | 32 ++
arch/x86/kvm/svm/nested.c | 10 +-
arch/x86/kvm/vmx/nested.c | 1 +
arch/x86/kvm/x86.c | 26 +-
arch/x86/kvm/x86.h | 10 -
tools/lib/traceevent/plugins/plugin_kvm.c | 4 +-
16 files changed, 530 insertions(+), 409 deletions(-)

--
2.32.0.288.g62a8d224e6-goog