Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory

From: Jason Gunthorpe
Date: Fri Oct 13 2023 - 09:45:52 EST


On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > > Claiming back the device also seems strange if the guest has been using
> > > > non-cacheable accesses since I think you could get write merging and
> > > > reordering with subsequent device accesses trying to reset the device.
> > >
> > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> >
> > We do have a good story for this part: use Device-nGnRE!
>
> Don't we actually need Device-nGnRnE for this, coupled with a DSB for
> endpoint completion?
>
> Device-nGnRE may be sufficient as a read from that device would ensure
> that the previous write is observable (potentially with a DMB if
> accessing separate device regions) but I don't think we do this now
> either. Even this, isn't it device-specific? I don't know enough about
> PCIe, posted writes, reordering, maybe others can shed some light.
>
> For Normal NC, if the access doesn't have side-effects (or rather the
> endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
> TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
> was pushed sufficiently far as not to affect subsequent writes by other
> CPUs.
>
> For I/O accesses that change some state of the device, I'm not sure the
> TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
> nE + DSB as long as the PCIe device plays along nicely.

Can someone explain this concern a little more simply please?

Let's try something simpler. I have no KVM. My kernel driver
creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a
write to the NormalNC and immediately unmaps the VMA

What is the issue?

And then how does making KVM the thing that creates the NormalNC
change this?

Not knowing the whole details, here is my story about how it should work:

Unmapping the VMA's must already have some NormalNC friendly ordering
barrier across all CPUs or we have a bigger problem. This barrier
definately must close write combining.

VFIO issues a config space write to reset the PCI function. Config
space writes MUST NOT write combine with anything. This is already
impossible for PCIe since they are different TLP types at the PCIe
level.

By the PCIe rules, config space write must order strictly after all
other CPU's accesses. Once the reset non-posted write returns back to
VFIO we know that:

1) There is no reference in any CPU page table to the MMIO PFN
2) No CPU has pending data in any write buffer
3) The interconnect and PCIe fabric have no inflight operations
4) The device is in a clean post-reset state

?

> knows all the details. The safest is for the VMM to keep it as Device (I
> think vfio-pci goes for the strongest nGnRnE).

We are probably going to allow VFIO to let userspace pick if it should
be pgprot_device or pgprot_writecombine.

The alias issue could be resolved by teaching KVM how to insert a
physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
the PFNs are never mapped in the hypervisor side.

Jsaon