RE: Virtualizing MSI-X on IMS via VFIO

From: Tian, Kevin
Date: Wed Jun 23 2021 - 22:41:39 EST


> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Sent: Thursday, June 24, 2021 9:19 AM
>
> Kevin!
>
> On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> >> > Curious about irte entry when IRQ remapping is enabled. Is it also
> >> > allocated at request_irq()?
> >>
> >> Good question. No, it has to be allocated right away. We stick the
> >> shutdown vector into the IRTE and then request_irq() will update it with
> >> the real one.
> >
> > There are max 64K irte entries per Intel VT-d. Do we consider it as
> > a limited resource in this new model, though it's much more than
> > CPU vectors?
>
> It's surely a limited resource. For me 64k entries seems to be plenty,
> but what do I know. I'm not a virtualization wizard.
>
> > Back to earlier discussion about guest ims support, you explained a layered
> > model where the paravirt interface sits between msi domain and vector
> > domain to get addr/data pair from the host. In this way it could provide
> > a feedback mechanism for both msi and ims devices, thus not specific
> > to ims only. Then considering the transition window where not all guest
> > OSes may support paravirt interface at the same time (or there are
> > multiple paravirt interfaces which takes time for host to support all),
> > would below staging approach still makes sense?
> >
> > 1) Fix the lost interrupt issue in existing MSI virtualization flow;
>
> That _cannot_ be fixed without a hypercall. See my reply to Alex.

The lost interrupt issue was caused due to resizing based on stale
impression of vector exhaustion.

With your explanation this issue can be partially fixed by having Qemu
allocate all possible irqs when guest enables msi-x and never resizes
it before guest disables msi-x.

The remaining problem is no feedback to block guest request_irq()
in case of vector shortage. This has to be solved via paravirt interface
but fixing lost interrupt alone is still a step forward for guest which
doesn't implement the paravirt interface.

>
> > 2) Virtualize MSI-X on IMS, bearing the same request_irq() problem;
>
> That solves what? Maybe your perceived roadmap problem, but certainly
> not any technical problem in the right way. Again: See my reply to Alex.

Not about roadmap. See explanation below.

>
> > 3) Develop a paravirt interface to solve request_irq() problem for
> > both msi and ims devices;
>
> First of all it's not a request_irq() problem: It's a plain resource
> management problem which requires proper interaction between host and
> guest.

sure.

>
> And yes, it _is_ the correct answer to the problem and as I outlined in
> my reply to Alex already it is _not_ rocket science and it won't make a
> significant difference on your timeline because it's straight forward
> and solves the problem properly with the added benefit to solve existing
> problems which should and could have been solved long ago.
>
> I don't care at all about the time you are wasting with half baken
> thoughts about avoiding to do the right thing, but I very much care
> about my time wasted to debunk them.
>

I'm really not thinking from any angle of roadmap thing, and I actually
very much appreciate all of your comments on the right direction.

All my comments are purely based on possible use scenarios. I will give
more explanation below and hope you can consider it as a thought
practice to compose the full picture based on your guidance, instead of
seeking half baken idea to waste your time. 😊

At any time guest OSes can be categorized into three classes:

a) doesn't implement any paravirt interface for vector allocation;

b) implement one paravirt interface that has been supported by KVM;

c) implement one paravirt interface which has not been supported by KVM;

The transition phase from c) to b) is undefined, but it does exist more
or less. For example a windows guest will never implement the interface
defined between Linux guest and Linux host. It will have its own hyperv
variation which likely takes time for KVM to emulate and claim support.

Transition from a) to b) or a) to c) is a guest-side choice. It's not controlled
by the host world.

Here I didn't further differentiate whether a guest OS support ims, since
once a supported paravirt interface is in place both msi and ims can get
necessary feedback info from the host side.

Then let's look at the host side:

1) kernel versions before we conduct any discussed change:

This is a known broken world as you explained. irq resizing could
lead to lost interrupts in all three guest classes. The only mitigation
is to document this limitation somewhere.

We'll not enable ims based on this broken framework.

2) kernel versions after we make a clean refactoring:

a) For guest OS which doesn't implement paravirt interface:
c) For guest OS which implement a paravirt interface not
supported by KVM:

You confirmed that recent kernels (since 4.15+) all uses
reservation mode to avoid vector exhaustion. So VFIO can
define a new protocol asking its userspace to disable resizing
by allocating all possible irqs when guest msix is enabled. This
is one step forward by fixing the lost interrupt issue and is what
the step-1) in my proposal tries to achieve.

But there remains a limitation as no feedback is provided into
the guest to block it when host vectors are in shortage. But
that's the reality that we have to bear for such guest. VFIO
returns such error info to and let userspace decide how to
react.

It's not elegant but improved over the status quo. and we do
see value of enabling ims-capable device/subdevice within
such guest, though the guest will just fall back to use msix.
This is about step-2) in my proposal;

b) For guest OS which implement a paravirt interface supported
by KVM:

This is the right framework that you just described. With such
interface in place, the guest needs to proactively claim resource
from the host side before it can actually enable a specific msi/ims
entry. Everything is well set with cooperation between host/guest.

If you agree with above, then it's not something that we want to make
half-baked stuff with my proposal. It's really about splitting tasks by doing
conservative stuff first which works for most guests and then optimizing
things for new guests. And strictly speaking we don't want to do paravirt
stuff very late since it's the much cleaner approach in concept. We do plan
to find some resource to initiate a separate design discussion in parallel
with fixing interrupt lost issue for a) and c).

Does this rationale sound good to you?

Thanks
Kevin