Trouble with perf_guest_get_msrs() and pebs_no_isolation

From: Jim Mattson
Date: Tue Feb 09 2021 - 20:25:49 EST


On a host that suffers from pebs_no_isolation, perf_guest_get_msrs()
adds an entry to cpuc->guest_switch_msrs for
MSR_IA32_PEBS_ENABLE. Kvm's atomic_switch_perf_msrs() is the only
caller of perf_guest_get_msrs(). If atomic_switch_perf_msrs() finds an
entry for MSR_IA32_PEBS_ENABLE in cpuc->guest_switch_msrs, it puts the
"host" value of this entry on the VM-exit MSR-load list for the
current VMCS. At the next VM-exit, that "host" value will be written
to MSR_IA32_PEBS_ENABLE.

The problem is that by the next VM-exit, that "host" value may be
stale. Though maskable interrupts are blocked from well before
atomic_switch_perf_msrs() to the next VM-entry, PMIs are delivered as
NMIs. Moreover, due to PMI throttling, the PMI handler may clear bits
in MSR_IA32_PEBS_ENABLE. See the comment to that effect in
handle_pmi_common(). In fact, by the time that perf_guest_get_msrs()
returns to its caller, the "host" value that it has recorded for
MSR_IA32_PEBS_ENABLE could already be stale.

What happens if a VM-exit sets a bit in MSR_IA32_PEBS_ENABLE that the
perf subsystem thinks should be clear? In the short term, nothing
happens at all. But note that this situation may not get better at the
next VM-exit, because kvm won't add MSR_IA32_PEBS_ENABLE to the
VM-exit MSR-load list if perf_guest_get_mrs() reports that the "host"
value of the MSR is 0. So, if the new MSR_IA32_PEBS_ENABLE "host"
value is 0, the stale bits can actually persist for a long time.

If, at some point in the future, the perf subsystem programs a counter
overflow interrupt on the same PMC for a PEBS-capable event, we are in
for some nasty surprises. (Note that the perf subsystem never
*intentionally* programs a PMC for both PEBS and counter overflow
interrupts at the same time.)

If a PEBS assist is triggered while in VMX non-root operation, the CPU
will try to access the host's DS_AREA using the guest's page tables,
and a page fault is likely (specifically on a read of the PEBS
interrupt threshold at offset 0x38 in the DS_AREA).

If the PEBS interrupt threshold is met while in VMX root operation,
two separate PMIs are generated: one for the PEBS interrupt threshold
and one for the counter overflow. This results in a message from
unknown_nmi_error(): "Uhhuh. NMI received for unknown reason <xx> on
CPU <n>."