Re: [PATCH 0/3] KVM: x86: KVM_MEM_PCI_HOLE memory

From: Vitaly Kuznetsov
Date: Thu Aug 06 2020 - 05:22:45 EST


"Michael S. Tsirkin" <mst@xxxxxxxxxx> writes:

> On Tue, Jul 28, 2020 at 04:37:38PM +0200, Vitaly Kuznetsov wrote:
>> This is a continuation of "[PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES
>> memory" work:
>> https://lore.kernel.org/kvm/20200514180540.52407-1-vkuznets@xxxxxxxxxx/
>> and pairs with Julia's "x86/PCI: Use MMCONFIG by default for KVM guests":
>> https://lore.kernel.org/linux-pci/20200722001513.298315-1-jusual@xxxxxxxxxx/
>>
>> PCIe config space can (depending on the configuration) be quite big but
>> usually is sparsely populated. Guest may scan it by accessing individual
>> device's page which, when device is missing, is supposed to have 'pci
>> hole' semantics: reads return '0xff' and writes get discarded.
>>
>> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot
>> I observed 8193 accesses to PCI hole memory. When such exit is handled
>> in KVM without exiting to userspace, it takes roughly 0.000001 sec.
>> Handling the same exit in userspace is six times slower (0.000006 sec) so
>> the overal; difference is 0.04 sec. This may be significant for 'microvm'
>> ideas.
>>
>> Note, the same speed can already be achieved by using KVM_MEM_READONLY
>> but doing this would require allocating real memory for all missing
>> devices and e.g. 8192 pages gives us 32mb. This will have to be allocated
>> for each guest separately and for 'microvm' use-cases this is likely
>> a no-go.
>>
>> Introduce special KVM_MEM_PCI_HOLE memory: userspace doesn't need to
>> back it with real memory, all reads from it are handled inside KVM and
>> return '0xff'. Writes still go to userspace but these should be extremely
>> rare.
>>
>> The original 'KVM_MEM_ALLONES' idea had additional optimizations: KVM
>> was mapping all 'PCI hole' pages to a single read-only page stuffed with
>> 0xff. This is omitted in this submission as the benefits are unclear:
>> KVM will have to allocate SPTEs (either on demand or aggressively) and
>> this also consumes time/memory.
>
> Curious about this: if we do it aggressively on the 1st fault,
> how long does it take to allocate 256 huge page SPTEs?
> And the amount of memory seems pretty small then, right?

Right, this could work but we'll need a 2M region (one per KVM host of
course) filled with 0xff-s instead of a single 4k page.

Generally, I'd like to reach an agreement on whether this feature (and
the corresponding Julia's patch addding PV feature bit) is worthy. In
case it is (meaning it gets merged in this simplest form), we can
suggest further improvements. It would also help if firmware (SeaBIOS,
OVMF) would start recognizing the PV feature bit too, this way we'll be
seeing even bigger improvement and this may or may not be a deal-breaker
when it comes to the 'aggressive PTE mapping' idea.

--
Vitaly