Re: [RFC PATCH 1/3] content: Add VIRTIO_F_SWIOTLB to negotiate use of SWIOTLB bounce buffers

From: David Woodhouse
Date: Fri Apr 04 2025 - 07:16:06 EST


On Fri, 2025-04-04 at 06:37 -0400, Michael S. Tsirkin wrote:
> On Fri, Apr 04, 2025 at 11:15:33AM +0100, David Woodhouse wrote:
> > On Fri, 2025-04-04 at 10:27 +0100, David Woodhouse wrote:
> > > On 4 April 2025 09:32:39 BST, "Michael S. Tsirkin" <mst@xxxxxxxxxx>
> > > wrote:
> > > > On Fri, Apr 04, 2025 at 09:16:44AM +0100, David Woodhouse wrote:
> > > > > On Fri, 2025-04-04 at 04:09 -0400, Michael S. Tsirkin wrote:
> > > > > > On Fri, Apr 04, 2025 at 08:50:47AM +0100, David Woodhouse
> > > > > > wrote:
> > > > > > > What's annoying is that this should work out of the box
> > > > > > > *already* with
> > > > > > > virtio-mmio and a `restricted-dma-pool` — for systems which
> > > > > > > aren't
> > > > > > > afflicted by UEFI/ACPI/PCI as their discovery mechanisms.
> > > > > >
> > > > > >
> > > > > > That specifically would be just a driver bugfix then?
> > > > >
> > > > > I actually think it works out of the box and there isn't even a
> > > > > bug to
> > > > > fix. Haven't tested yet.
> > > > >
> > > > > The sad part is that the system does it all automatically *if* it
> > > > > has
> > > > > CONFIG_DMA_RESTRICTED_POOL (e.g. Linux) and the driver never even
> > > > > notices that the dma_ops it's using are the swiotlb ops using the
> > > > > provided buffer.
> > > > >
> > > > > Which is *kind* of nice... except that when on a guest OS which
> > > > > *isn't*
> > > > > Linux with CONFIG_DMA_RESTRICTED_POOL, the guest will just ignore
> > > > > the
> > > > > `restricted-dma-pool` node and try DMA to system memory anyway,
> > > > > which
> > > > > will fail.
> > > >
> > > > I mean, it's easy to misconfigure Linux, this is why we love it ;)
> > > > Why
> > > > is this such a concern?
> > >
> > > Because it's incompatible. In the DT world, perhaps this new *non-
> > > optional* feature/restriction should have come with a new
> > > "compatible" string such as "virtio-mmio-restricted-dma".
> > >
> > > Adding it without backwards compatibility wasn't ideal.
> > >
> > > > > That's why my proposal adds the negotiated VIRTIO_F_SWIOTLB
> > > > > feature, so
> > > > > that the device side can refuse, if the guest *isn't* agreeing to
> > > > > use
> > > > > the bounce buffer in the situations where it must do so.
> > > >
> > > >
> > > > OTOH then setting this feature and if you make the device force it,
> > > > you are breaking guests restricted-dma-pool which worked
> > > > previously, no?
> > >
> > > Yes. So a platform offering virtio-mmio with restricted DMA, if the
> > > driver doesn't accept the offered VIRTIO_F_SWIOTLB, may want to
> > > accept that negotiation anyway, and *hope* that the driver/OS are
> > > going to use the buffer anyway.
> > >
> > > I just didn't want to make that same mistake again when formalising
> > > and documenting this, and especially when attempting to extend it to
> > > PCI.
> >
> > Of course, the beauty of the restricted-dma-pool as supported by DT is
> > that it's a *system* memory buffer, which is actually OK as long as
> > it's reserved address space and not just part of normal system memory
> > that an unsuspecting guest might use for general purposes. So the
> > trusted part of the hypervisor (e.g. pKVM) can *allow* the VMM access
> > to that space.
> >
> > It doesn't *have* to be on-device. That just seemed like the more
> > natural way to do it for PCI.
> >
> > I suppose we *could* allow for the virtio-pci transport to do it the
> > same way as virtio-mmio though. The VIRTIO_PCI_CAP_SWIOTLB capability¹
> > could reference a range of system memory space, just like the
> > `restricted-dma-pool` property does.
> >
> > It's a weird abstraction especially for a physical PCI device to do
> > that because the system memory space is outside its ownership. But in a
> > physical device it could be writable, and you could consider it the
> > responsibility of the system firmware to configure it appropriately, in
> > accordance with the IOMMU and other DMA restrictions of the platform.
> >
> > That does solve it for the CoCo case without addressing the P2P staging
> > case that Christoph mentions, though.
> >
> >
> > ¹ I will rename it, Christoph, if it survives at all. Probably
> > VIRTIO_F_RESTRICTED_DMA and VIRTIO_PCI_CAP_RESTRICTED_DMA but of course
> > it depends on the semantics we conclude it should have.
>
> OK. So basically, all this does, is a promise by driver to only
> DMA into a range of memory?

Basically, yes.

> This part, I get. I wouldn't put it in a capability, just in config
> space then.

Sure... but how? There are some things which are defined to be at fixed
locations in config space, like the vendor/device IDs, COMMAND, STATUS,
BARs, etc..

And for the rest of the optional things which might be in config space
of a given device... isn't that exactly what capabilities are for?

> What I don't get, is what does the *device* want, exactly?

The device wants to know that a driver won't try to use it without
understanding the restriction. Because otherwise the driver will just
give it system addresses for DMA and be sad, without any coherent
error/failure report about why.

(You could ask the same question about what the *device* wants with
VIRTIO_F_ACCESS_PLATFORM, and the answer is much the same).

Or maybe not the *device* per se, but the *system integrator* wants to
know that only operating systems which understand the restriction
described above, will attempt to drive the device in question.

We could achieve that by presenting the device with a completely new
PCI device/vendor ID so that old drivers don't match, or in the DT
model you could make a new "compatible" string for it. I chose to use a
VIRTIO_F_ bit for it instead, which seemed natural and allows the
device model (under the influence of the system integrator) to *choose*
whether a failure to negotiate such bit is fatal or not.

> Here's a vague idea to explain the question:
>
> some embedded devices can have addressing restrictions, such
> as the number of bits of an address.
> In another example, the legacy balloon device only supports addresses
> up to 48 bit.
> These can still be useful if driver does not use the inaccessible
> addresses.
>
>
>
> So far so good? Does this look like a generalization of your idea?

Makes sense.

> Now, a question: DMA API under linux at least, can actually work
> around device limitations using a bounce buffer. It does, however,
> need to know what to work around.
> So, maybe the device needs to expose a range?
> But what is the range in your case?

The main thing in the CoCo case is that the range in question must not
contain any memory which an unenlightened guest might treat as normal
system RAM (because it has to be accessible by the VMM, and that would
make it insecure if it's being used a general-purpose RAM).

So on x86 it might be an e820-reserved region for example.

I don't think we want the guest OS just *assuming* that there's usable
memory in that e820-reserved region, just because some device says that
it's actually capable of DMA to those addresses.

So it would probably want a separate object, like the separate
`restricted-dma-pool` in DT, which explicitly identifies that range as
a DMA bounce-buffer pool. We probably *can* do that even in ACPI with a
PRP0001 device today, using a `restricted-dma-pool` compatible
property.

Then the OS would need to spot this range in the config space, and say
"oh, I *do* have a swiotlb pool this device can reach", and use that.

Which is slightly less pretty than the DT model where the device's node
explicitly references the handle of the `restricted-dma-pool` node, but
I think that's OK. And it does kind of keep "device" capability
separate from "system" properties, in a comprehensible way.

> Maybe the new registers list the range of addresses device can access,
> and driver promises to be within that range by negotiating the
> feature bit?

That should work for the CoCo use case, yes.

Attachment: smime.p7s
Description: S/MIME cryptographic signature