Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

From: Yan Zhao
Date: Mon Sep 04 2023 - 03:31:34 EST


...
> > Actually, I don't even completely understand how you're seeing CoW behavior in
> > the first place. No sane guest should blindly read (or execute) uninitialized
> > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off
> > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > that KSM is wildly insecure for the guest, especially given the number of speculative
> > execution attacks these days).
> I'm running a linux guest.
> KSM is not turned on both in guest and host.
> Both guest and host have turned on transparent huge page.
>
> The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> which will cause
> (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> mapped.
> (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> which will fail
>
> The guest then writes the GFN.
> This step will trigger (huge pmd split for huge page case) and .change_pte().
>
> My guest is surely a sane guest. But currently I can't find out why
> certain pages are read before write.
> Will return back to you the reason after figuring it out after my long vacation.
Finally I figured out the reason.

Except 4 pages were read before written from vBIOS (I just want to skip finding
out why vBIOS does this), the remaining thousands of pages were read before
written from the guest Linux kernel.

If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
"CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
"init_on_free=1", this read before written problem goes away.

However, turning on those configs has side effects as said in kernel config
message:
"all page allocator and slab allocator memory will be zeroed when allocated,
eliminating many kinds of "uninitialized heap memory" flaws, especially
heap content exposures. The performance impact varies by workload, but most
cases see <1% impact. Some synthetic workloads have measured as high as 7%."

If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
the root cause for all the reads of uninitialized heap memory are related to
page cache pages of the guest virtual devices (specifically the virtual IDE
device in my case).

When the guest kernel triggers a guest block device read ahead, pages are
allocated as page cache pages, and requests to read disk data into the page
cache are issued.

The disk data read requests will cause dma_direct_map_page() called if vIOMMU
is not enabled. Then, because the virtual IDE device can only direct access
32-bit DMA address (equal to GPA) at maximum, swiotlb will be used as DMA
bounce if page cache pages are with GPA > 32 bits.

Then the call path is
dma_direct_map_page() --> swiotlb_map() -->swiotlb_tbl_map_single()

In swiotlb_tbl_map_single(), though DMA direction is DMA_FROM_DEVICE,
this swiotlb_tbl_map_single() will call
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE) to read page cache
content to the bounce buffer first.
Then, during device DMAs, device content is DMAed into the bounce buffer.
After that, the bounce buffer data will be copied back to the page cache page
after calling dma_direct_unmap_page() --> swiotlb_tbl_unmap_single().


IOW, before reading ahead device data into the page cache, the page cache is
read and copied to the bounce buffer, though an immediate write is followed to
copy bounce buffer data back to the page cache.

This explains why it's observed in host that most pages are written immediately
after it's read, and .change_pte() occurs heavily during guest boot-up and
nested guest boot-up, -- when disk readahead happens abundantly.

The reason for this unconditional read of page into bounce buffer
(caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
is explained in the code:

/*
* When dir == DMA_FROM_DEVICE we could omit the copy from the orig
* to the tlb buffer, if we knew for sure the device will
* overwrite the entire current content. But we don't. Thus
* unconditional bounce may prevent leaking swiotlb content (i.e.
* kernel memory) to user-space.
*/

If we neglect this risk and do changes like
- swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
+ if (dir != DMA_FROM_DEVICE)
+ swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);

the issue of pages read before written from guest kernel just went away.

I don't think it's a swiotlb bug, because to prevent leaking swiotlb
content, if target page content is not copied firstly to the swiotlb's
bounce buffer, then the bounce buffer needs to be initialized to 0.
However, swiotlb_tbl_map_single() does not know whether the target page
is initialized or not. Then, it would cause page content to be trimmed
if device does not overwrite the entire memory.

>
> >
> > If there's something else going on, i.e. if your VM really is somehow generating
> > reads before writes, and if we really want to optimize use cases that can't use
> > hugepages for whatever reason, I would much prefer to do something like add a
> > memslot flag to state that the memslot should *always* be mapped writable. Because
> Will check if this flag is necessary after figuring out the reason.
As explained above, I think it's a valid and non-rare practice in guest kernel to
cause read of uninitialized heap memory. And the host admin may not know
exactly when it's appropriate to apply the memslot flag.

Do you think it's good to make the "always write_fault = true" solution enabled by default?