Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

From: David Hildenbrand
Date: Wed Mar 20 2024 - 08:56:59 EST


On 20.03.24 01:50, Sean Christopherson wrote:
Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
i.e. when creating mappings for KVM guests, instead of when zapping or
modifying SPTEs, e.g. when dropping mappings.

The motivation is twofold:

1. Marking folios dirty and accessed when zapping can be extremely
expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
dirty and accessed for all 512*512 SPTEs.

2. x86 diverges from literally every other architecture, which updates
folios when mappings are created. AFAIK, x86 is unique in that it's
the only KVM arch that prefetches PTEs, so it's not quite an apples-
to-apples comparison, but I don't see any reason for the dirty logic
in particular to be different.


Already sorry for the lengthy reply.


On "ordinary" process page tables on x86, it behaves as follows:

1) A page might be mapped writable but the PTE might not be dirty. Once
written to, HW will set the PTE dirty bit.

2) A page might be mapped but the PTE might not be young. Once accessed,
HW will set the PTE young bit.

3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
the folio (folio_mark_accessed()). The latter is done conditionally
only (vma_has_recency()).

BUT, when zapping an anon folio, we don't do that, because there zapping implies "gone for good" and not "content must go to a file".

4) When temporarily unmapping a folio for migration/swapout, we
primarily only move the dirty PTE bit to the folio.


GUP is different, because the PTEs might change after we pinned the page and wrote to it. We don't modify the PTEs and expect the GUP user to do the right thing (set dirty/accessed). For example, unpin_user_pages_dirty_lock() would mark the page dirty when unpinning, where the PTE might long be gone.

So GUP does not really behave like HW access.


Secondary page tables are different to ordinary GUP, and KVM ends up using GUP to some degree to simulate HW access; regarding NUMA-hinting, KVM already revealed to be very different to all other GUP users. [1]

And I recall that at some point I raised that we might want to have a dedicate interface for these "mmu-notifier" based page table synchonization mechanism.

But KVM ends up setting folio dirty/access flags itself, like other GUP users. I do wonder if secondary page tables should be messing with folio flags *at all*, and if there would be ways to to it differently using PTEs.

We make sure to synchronize the secondary page tables to the process page tables using MMU notifiers: when we write-protect/unmap a PTE, we write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely different.


I once had the following idea, but I am not sure about all implications, just wanted to raise it because it matches the topic here:

Secondary page tables kind-of behave like "HW" access. If there is a write access, we would expect the original PTE to become dirty, not the mapped folio.

1) When KVM wants to map a page into the secondary page table, we
require the PTE to be young (like a HW access). The SPTE can remain
old.

2) When KVM wants to map a page writable into the secondary page table,
we require the PTE to be dirty (like a HW access). The SPTE can
remain old.

3) When core MM clears the PTE dirty/young bit, we notify the secondary
page table to adjust: for example, if the dirty bit gets cleared,
the page cannot be writable in the secondary MMU.

4) GUP-fast cannot set the pte dirty/young, so we would fallback to slow
GUP, wehre we hold the PTL, and simply modify the PTE to have the
accessed/dirty bit set.

5) Prefetching would similarly be limited to that (only prefetch if PTE
is already dirty etc.).

6) Dirty/accessed bits not longer have to be synced from the secondary
page table to the process page table. Because an SPTE being dirty
implies that the PTE is dirty.


One tricky bit, why ordinary GUP modifies the folio and not the PTE, is concurrent HW access. For example, when we want to mark a PTE accessed, it could happen that HW concurrently tries marking the PTE dirty. We must not lose that update, so we have to guarantee an atomic update (maybe avoidable in some cases).

What would be the implications? We'd leave setting folio flags to the MM core. That also implies, that if you shutdown a VM an zap all anon folios, you wouldn't have to mark any folio dirty: the pte is dirty, and MM core can decide to ignore that flag since it will discard the page either way.

Downsides? Likely many I have not yet thought about (TLB flushes etc). Just mentioning it because in context of [1] I was wondering if something that uses MMU notifiers should really be messing with dirty/young flags :)


I tagged this RFC as it is barely tested, and because I'm not 100% positive
there isn't some weird edge case I'm missing, which is why I Cc'd David H.
and Matthew.

We'd be in trouble if someone would detect that all PTEs are clean, so it can clear the folio dirty flag (for example, after writeback). Then, we would write using the SPTE and the folio+PTE would be clean. If we then evict the "clean" folio that is actually dirty, we would be in trouble.

Well, we would set the SPTE dirty flag I guess. But I cannot immediately tell if that one would be synced back to the folio? Would we have a mechanism in place to prevent that?


Note, I'm going to be offline from ~now until April 1st. I rushed this out
as it could impact David S.'s kvm_follow_pfn series[*], which is imminent.
E.g. if KVM stops marking pages dirty and accessed everywhere, adding
SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated
seems like a poor tradeoff (medium complexity and annoying to maintain, for
not much benefit).

Regarding David S.'s series, I wouldn't be at all opposed to going even
further and having x86 follow all architectures by marking pages accessed
_only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH
as appropriate, and thus dedup a fair bit of arch code.

FOLL_TOUCH is weird (excluding weird devmap stuff):

1) For PTEs (follow_page_pte), we set the page dirty and accessed, and
do not modify the PTE. For THP (follow_trans_huge_pmd), we set the
PMD young/dirty and don't mess with the folio.

2) FOLL_TOUCH is not implemented for hugetlb.

3) FOLL_TOUCH is not implemented for GUP-fast.

I'd leave that alone :)


[1] https://lore.kernel.org/lkml/20230727212845.135673-1-david@xxxxxxxxxx/T/#u
--
Cheers,

David / dhildenb