Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

From: Jason Gunthorpe
Date: Fri Nov 06 2020 - 08:14:23 EST


On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> > The interrupt controller is responsible to create an addr/data pair
> > for an interrupt message. It sets the message format and ensures it
> > routes to the proper CPU interrupt handler. Everything about the
> > addr/data pair is owned by the platform interrupt controller.
> >
> > Devices do not create interrupts. They only trigger the addr/data pair
> > the platform gives them.
>
> I guess that we may just view it from different angles. On x86 platform,
> a MSI/IMS capable device directly composes interrupt messages, with
> addr/data pair filled by OS.

Yes, all platforms work like that. The addr/data pair is *opaque* to
the device. Only the platform interrupt controller component
understands how to form those values.

> If there is no IOMMU remapping enabled in the middle, the message
> just hits the CPU. Your description possibly is from software side,
> e.g. describing the hierarchical IRQ domain concept?

I suppose you could say that. Technically the APIC doesn't form any
addr/data pairs, but the configuration of the APIC, IOMMU and other
platform components define what addr/data pairs are acceptable.

The IRQ domain stuff broadly puts responsibilty to form these values
in the IRQ layer which abstracts all the platform detatils. In Linux
we expect the platform to provide the IRQ Domain tha can specify
working addr/data pairs.

> I agree with this point, just as how pci-hyperv.c works. In concept Linux
> guest driver should be able to use IMS when running on Hyper-v. There
> is no such thing for KVM, but possibly one day we will need similar stuff.
> Before that happens the guest could choose to simply disallow devmsi
> by default in the platform code (inventing a hypercall just for 'disable'
> doesn't make sense) and ignore the IMS cap. One small open is whether
> this can be done in one central-place. The detection of running as guest
> is done in arch-specific code. Do we need disabling devmsi for every arch?
>
> But when talking about virtualization it's not good to assume the guest
> behavior. It's perfectly sane to run a guest OS which doesn't implement
> any PV stuff (thus don't know running in a VM) but do support IMS. In
> such scenario the IMS cap allows the hypervisor to educate the guest
> driver to use MSI instead of IMS, as long as the driver follows the device
> spec. In this regard I don't think that the IMS cap will be a short-term
> thing, although Linux may choose to not use it.

The IMS flag belongs in the platform not in the devices.

For instance you could put a "disable IMS" flag in the ACPI tables, in
the config space of the emuulated root port, or any other areas that
clearly belong to the platform.

The OS logic would be
- If no IMS information found then use IMS (Bare metal)
- If the IMS disable flag is found then
- If (future) hypercall available and the OS knows how to use it
then use IMS
- If no hypercall found, or no OS knowledge, fail IMS

Our devices can use IMS even in a pure no-emulation
configurations. Saying that we need to insert complicated security
sensitive emulation just to get IMS in the guest is absolutely crazy.

> Do you mind providing the link? There were lots of discussions between
> you and Thomas. I failed to locate the exact mail when searching above
> keywords.

Read through these two threads:

https://lore.kernel.org/linux-hyperv/20200821002949.049867339@xxxxxxxxxxxxx/
https://lore.kernel.org/dmaengine/159534734833.28840.10067945890695808535.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxx/

Jason