RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

From: Tian, Kevin
Date: Wed Aug 19 2020 - 03:29:59 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Tuesday, August 18, 2020 7:50 PM
>
> On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote:
> > The difference in my reply is not just about the implementation gap
> > of growing a userspace DMA framework to a passthrough framework.
> > My real point is about the different goals that each wants to achieve.
> > Userspace DMA is purely about allowing userspace to directly access
> > the portal and do DMA, but the wq configuration is always under kernel
> > driver's control. On the other hand, passthrough means delegating full
> > control of the wq to the guest and then associated mock-up (live migration,
> > vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> > value of mixing them together when there is already a good candidate
> > to handle passthrough...
>
> In Linux a 'VM' and virtualization has always been a normal system
> process that uses a few extra kernel features. This has been more or
> less the cornerstone of that design since the start.
>
> In that view it doesn't make any sense to say that uAPI from idxd that
> is useful for virtualization somehow doesn't belong as part of the
> standard uAPI.

The point is that we already have a more standard uAPI (VFIO) which
is unified and vendor-agnostic to userspace. Creating a idxd specific
uAPI to absorb similar requirements that VFIO already does is not
compelling and instead causes more trouble to Qemu or other VMMs
as they need to deal with every such driver uAPI even when Qemu
itself has no interest in the device detail (since the real user is inside
guest).

>
> Especially when it is such a small detail like what APIs are used to
> configure the wq.
>
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?
>

I'm not familiar with how container suspend/resume is done today.
But my gut-feeling is that it's different from virtualization. For
virtualization, the whole wq is assigned to the guest thus the uAPI
must provide a way to save the wq state including its configuration
at suspsend, and then restore the state to what guest expects when
resume. However in container case which does userspace DMA, the
wq is managed by host kernel and could be shared between multiple
containers. So the wq state is irrelevant to container. The only relevant
state is the in-fly workloads which needs a draining interface. In this
view I think the two have a major difference.

Thanks
Kevin