Re: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function

From: Sean Christopherson
Date: Mon Feb 05 2024 - 20:25:34 EST


On Tue, Oct 03, 2023, Maxim Levitsky wrote:
> У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> > From: David Stevens <stevensd@xxxxxxxxxxxx>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>
> > ---
> > include/linux/kvm_host.h | 16 ++++
> > virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++-----------------
> > virt/kvm/kvm_mm.h | 3 +-
> > virt/kvm/pfncache.c | 10 ++-
> > 4 files changed, 123 insertions(+), 77 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index fb6c6109fdca..c2e0ddf14dba 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> > /*
> > * error pfns indicate that the gfn is in slot but faild to
> > @@ -1177,6 +1178,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > void kvm_release_page_clean(struct page *page);
> > void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > + const struct kvm_memory_slot *slot;
> > + gfn_t gfn;
> > + unsigned int flags;
> It might be useful for the future readers to have a note about which values
> the flags can take. (e.g one of the 'FOLL_* flags).

+1. It doesn't have to (probably shouldn't?) say _which_ FOLL_* flags are supported
(I forget if there was going to be a restriction). Just a comment explaining that
it's used to pass various FOLL_* flags.

> > + bool atomic;
>
> I wish we had FOLL_ATOMIC, because there is a slight usability regression in
> regard to the fact, that now some of the flags are here and in the 'atomic'
> variable.
>
>
> > + /* Try to create a writable mapping even for a read fault */
> > + bool try_map_writable;
> > +
> > + /* Outputs of __kvm_follow_pfn */
> > + hva_t hva;
> > + bool writable;
> > +};
>
>
> Another small usability note. I feel like the name 'follow_pfn' is not the
> best name for this.
>
> I think ultimately it comes from 'follow_pte()' and even that name IMHO is
> incorrect. the 'follow_pte' should be called 'lookup_kernel_pte', because that
> is what it does - it finds a pointer to pte of hva in its process's kernel
> page tables.

Yeah, I 100% agree follow_pte() is a bad name (I suggested kvm_follow_pfn()), but
for me, this falls into the "if you can't beat 'em, join 'em" scenario. It's kinda
like the XKCD comic about 14 standards; coming up with a new name because the
existing one sucks doesn't make the world any better, it's just one more less
than perfect name for developers to remember :-)

> IMHO, the kvm_follow_pfn struct should be called something like
> gfn_to_pfn_params, because it specifies on how to convert gfn to pfn (or
> create the pfn if the page was swapped out).

Again, I don't disagree in a vacuum, but I want the name of the struct to be
tightly coupled to the API, e.g. so that it's super obvious where in KVM's flows
the struct is used, at the expense of making it less obviously how exactly said
flow uses the struct.

> Same note applies to '__kvm_follow_pfn()'
>
> If you really want to keep the original name though, I won't argue over this.
>
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ee6090ecb1fe..9b33a59c6d65 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2512,8 +2512,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > * true indicates success, otherwise false is returned. It's also the
> > * only part that runs if we can in atomic context.
> > */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > - bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > {
> > struct page *page[1];
> >
> > @@ -2522,14 +2521,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > * or the caller allows to map a writable pfn for a read fault
> > * request.
> > */
> > - if (!(write_fault || writable))
> > + if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
> > return false;
>
> A small note: the 'foll' variable and the FOLL_* flags have different
> meaning: foll is the pointer to a new 'struct kvm_follow_pfn' while FOLL_ is
> from the folio API, I think.
>
> Maybe we should rename the 'foll' to something, like 'args' or something like
> that?

Hmm, I was going for something similar to "struct kvm_page_fault *fault" (this
was another suggestion of mine). I don't love args, mainly because the usage
isn't tied back to the struct name, e.g. deep in hva_to_pfn_remapped() and friends,
"args" starts to lose context/meaning.

Looking at this with fresh eyes, I still like "foll", though I agree it's far
from ideal. Maybe an acronym? "kfp" isn't used in the kernel, AFAICT. I'd vote
for "foll" over "kfp", but I'm ok with either (or whatever, so long as the name
is tied back to the struct in some way, i.e. not super generic).

> > - /* map read fault as writable if possible */
> > - if (unlikely(!write_fault) && writable) {
> > + if (foll->flags & FOLL_WRITE) {
> > + foll->writable = true;
> > + } else if (foll->try_map_writable) {
> > struct page *wpage;
> >
> > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > - *writable = true;
> > + /* map read fault as writable if possible */
> > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > + foll->writable = true;
> > put_page(page);
> > page = wpage;
>
> Regardless of this patch, I am wondering, what was the reason to first map the
> page in the same way as requested and then try to map it as writable.
>
> Since the vast majority of the guest pages are likely to be writable, isn't
> it better to first opportunistically map them as writable and if that fails,
> then try to map readonly?

KVM actually does do that. hva_to_pfn_fast() tries to map WRITABLE, and then
only falls back to the slow path if that fails.

As for why KVM doesn't "try" to faultin the hva as writable, that would break
CoW and probably KSM as well. I.e. if KVM _asked_ for a writable mapping instead
of opportunistically seeing if the primary MMU created a writable mapping.