Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
From: Marc Zyngier
Date: Wed Mar 26 2025 - 14:02:36 EST
On Wed, 26 Mar 2025 16:10:45 +0000,
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Mar 26, 2025, Marc Zyngier wrote:
> > On Wed, 26 Mar 2025 14:53:34 +0000,
> > Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Mar 26, 2025, Ankit Agrawal wrote:
> > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > > > > > the toys...
> > > > > > > >
> > > > > > > > Fixing it is either faulting all access attempts or mapping it
> > > > > > > > cachable to the S2 (as this series is trying to do)..
> > > > > > >
> > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > > > > > hardware (or rather reject the memslot creation), cacheable S2
> > > > > > > otherwise.
> > > > > >
> > > > > > I have no objection, Ankit are you able to make a failure patch?
> > > > >
> > > > > I'd wait until the KVM maintainers have their say.
> > > > >
> > > >
> > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure
> > > > patch in memslot creation?
> > >
> > > That's not sufficient. As pointed out multiple times in this thread, any checks
> > > done at memslot creation are best effort "courtesies" provided to userspace to
> > > avoid terminating running VMs when the memory is faulted in.
> > >
> > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is
> > > not.
> > >
> > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB,
> > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB,
> > > KVM can simply honor the cacheability based on the VMA. Neither of those requires
> >
> > Remind me how this work with stuff such as guestmemfd, which, by
> > definition, doesn't have a userspace mapping?
>
> Definitely not through a memslot flag. The cacheability would be a property of
> the guest_memfd inode, similar to how it's a property of the underlying device
> in this case.
It's *not* a property of the device. It's a property of the mapping.
> I don't entirely see what guest_memfd has to do with this.
You were the one mentioning sampling the cacheability via the VMA. As
far as I understand guestmemfd, there is no VMA to speak of.
> One of the big
> advantages of guest_memfd is that KVM has complete control over the lifecycle of
> the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there
> are valid host mappings when memory is unmapped from the guest, and so can't do
> the necessary maintenance. I agree with Jason's earlier statement that that's a
> solvable kernel flaw.
>
> For guest_memfd, KVM already does maintenance operations when memory is reclaimed,
> for both SNP and TDX. I don't think ARM's cacheability stuff would require any
> new functionality in guest_memfd.
I don't know how you reconcile the lack of host mapping and cache
maintenance. The latter cannot take place without the former.
>
> > > a memslot flag. A KVM capability to enumerate FWB support would be nice though,
> > > e.g. so userspace can assert and bail early without ever hitting an
> > > ioctl error.
> >
> > It's not "nice". It's mandatory. And FWB is definitely *not* something
> > we want to expose as such.
>
> I agree a capability is mandatory if we're adding a memslot flag, but I don't
> think it's mandatory if this is all handled through kernel plumbing.
It is mandatory, full stop. Otherwise, userspace is able to migrate a
VM from an FWB host to a non-FWB one, start the VM, blow up on the
first page fault. That's not an acceptable outcome.
>
> > > If we want to support existing setups that happen to work by dumb luck or careful
> > > configuration, then that should probably be an admin decision to support the
> > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag.
> >
> > No. That's not how we handle an ABI issue. VM migration, with and
> > without FWB, can happen in both direction, and must have clear
> > semantics. So NAK to a kernel parameter.
> >
> > If I have a VM with a device mapped as *device* on FWB host, I must be
> > able to migrate it to non-FWB host, and back. A device mapped as
> > *cacheable* can only be migrated between FWB-capable hosts.
>
> But I thought the whole problem is that mapping this fancy memory as device is
> unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything
> in the first place?
I don't know where you got that idea. This is all about what memory
type is exposed to a guest:
- with FWB, no need for CMOs, so cacheable memory is allowed if the
device supports it (i.e. it actually exposes memory), and device
otherwise.
- without FWB, CMOs are required, and we don't have a host mapping for
these pages. As a fallback, the mapping is device only, as this
doesn't require any CMO by definition.
There is no notion of "safety" here.
> > Importantly, it is *userspace* that is in charge of deciding how the
> > device is mapped at S2. And the memslot flag is the correct
> > abstraction for that.
>
> I strongly disagree. Whatever owns the underlying physical memory is in charge,
> not userspace. For memory that's backed by a VMA, userspace can influence the
> behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state
> from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd.
I don't buy that. Userspace needs to know the semantics of the memory
it gives to the guest. Or at least discover that the same device
plugged into to different hosts will have different behaviours. Just
letting things rip is not an acceptable outcome.
> I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g.
> to disallow mapping memory as writable even if the VMA is writable. But IMO,
> adding a memslot flag to control cacheability isn't purely substractive.
I don't see how that solves the problem at hand: given the presence or
absence of FWB, allow userspace to discover as early as possible what
behaviour a piece of memory provided by a device will have, and
control it to handle migration.
M.
--
Without deviation from the norm, progress is not possible.