Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

From: Nadav Amit
Date: Mon Oct 03 2022 - 20:20:00 EST


On Oct 3, 2022, at 4:51 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:

> ⚠ External Email
>
> On October 3, 2022 2:28:40 PM PDT, Nadav Amit <namit@xxxxxxxxxx> wrote:
>> On Oct 3, 2022, at 2:06 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>>
>>> ⚠ External Email
>>>
>>> On October 3, 2022 10:34:15 AM PDT, Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote:
>>>>
>>>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>>>> better, introducing a new global var is our 'last resort' and should be
>>>>> avoided whenever possible. Alternatively, you can add a
>>>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>>>> hypervisor_x86' but I'm unsure if it's better.
>>>>>
>>>>> Also, please check Alex' question/suggestion.
>>>>
>>>> Here is my take (and Ajay knows probably more than me):
>>>>
>>>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>>>> The two options are either to use a reserved field (which who knows, might
>>>> be used one day) or some OEM ID. I am also not familiar with
>>>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>>>>
>>>> Anyhow, I understand (although not relate) to the objection for a new global
>>>> variable. How about explicitly calling this hardware bug a “bug” and using
>>>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>>>> can to resolve it.
>>>>
>>>> IOW, how about doing something along the lines of (not tested):
>>>>
>>>>
>>>> -- >8 --
>>>>
>>>> Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor
>>>>
>>>> ---
>>>> arch/x86/include/asm/cpufeatures.h | 1 +
>>>> arch/x86/kernel/cpu/common.c | 2 ++
>>>> arch/x86/kernel/cpu/vmware.c | 2 ++
>>>> arch/x86/pci/common.c | 6 ++++--
>>>> 4 files changed, 9 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>>>> index ef4775c6db01..216b6f357b6d 100644
>>>> --- a/arch/x86/include/asm/cpufeatures.h
>>>> +++ b/arch/x86/include/asm/cpufeatures.h
>>>> @@ -460,5 +460,6 @@
>>>> #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
>>>> #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
>>>> #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>>>> +#define X86_BUG_ECAM_MMIO X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
>>>>
>>>> #endif /* _ASM_X86_CPUFEATURES_H */
>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>> index 3e508f239098..c94175fa304b 100644
>>>> --- a/arch/x86/kernel/cpu/common.c
>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>> @@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
>>>> {
>>>> u64 ia32_cap = x86_read_arch_cap_msr();
>>>>
>>>> + setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
>>>> +
>>>> /* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
>>>> if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
>>>> !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
>>>> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
>>>> index 02039ec3597d..8903776284a6 100644
>>>> --- a/arch/x86/kernel/cpu/vmware.c
>>>> +++ b/arch/x86/kernel/cpu/vmware.c
>>>> @@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
>>>> setup_force_cpu_cap(X86_FEATURE_VMCALL);
>>>> else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
>>>> setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
>>>> +
>>>> + setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
>>>> }
>>>>
>>>> static void __init vmware_platform_setup(void)
>>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>>> index ddb798603201..bc81cf4c1014 100644
>>>> --- a/arch/x86/pci/common.c
>>>> +++ b/arch/x86/pci/common.c
>>>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>> int reg, int len, u32 *val)
>>>> {
>>>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>>>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>>>> + (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>>> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>>> if (raw_pci_ext_ops)
>>>> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>> int reg, int len, u32 val)
>>>> {
>>>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>>>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>>>> + (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>>> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>>> if (raw_pci_ext_ops)
>>>> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>>
>>> Also... any reason we can't just set raw_pci_ops == raw_ext_pci_ops for the case when the latter is preferred, and dispense with the conditionals in the use path? Similarly, raw_ext_pci_ops could be pointed to error routines instead of left at NULL.
>>
>> I understood from Ajay that the initialization of raw_ext_pci_ops can be
>> done after the hypervisor initialization takes place, so doing what exactly
>> what you proposed by is not possible. It can probably be resolved, but I do
>> not think the end result would be simpler or cleaner. I think that the “bug”
>> solution really conveys the behavior.
>>
>> IIUC performance would not be noticeable affected by 2 more conditional
>> branches.
>
> Isn't that exactly what you would want?!?

Two branches (which are probably mostly predicted correctly) are
inexpensive (nanoseconds?)

Causing a VM-exit, and the whole mess of handling it in the hypervisor
(potentially the userspace part of the hypervisor) is expensive
(microseconds). IIUC, Ajay wants to let reads to pass through to memory,
avoiding these VM-exits.