Re: [RFC KERNEL PATCH v4 3/3] PCI/sysfs: Add gsi sysfs for pci_dev

From: Chen, Jiqian
Date: Fri Mar 01 2024 - 02:57:19 EST


Hi Bjorn,
Looking forward to getting your more inputs and suggestions.
It seems /sys/bus/acpi/devices/PNP0A03:00/ is not a good place to create gsi sysfs.

On 2024/2/15 16:37, Roger Pau Monné wrote:
> On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
>> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
>>> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
>>>> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
>>>>> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
>>>>>> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
>>>>>>> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>> On 2024/1/24 00:02, Bjorn Helgaas wrote:
>>>>>>>>>>>> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>>>>>>>>>>>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>>>>>>>>>>>>> There is a need for some scenarios to use gsi sysfs.
>>>>>>>>>>>>>>> For example, when xen passthrough a device to dumU, it will
>>>>>>>>>>>>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>>>>>>>>>>>>> number.
>>>>>>>>>>>>>>> So, add gsi sysfs for that and for other potential scenarios.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't know enough about Xen to know why it needs the GSI in
>>>>>>>>>>>>>> userspace. Is this passthrough brand new functionality that can't be
>>>>>>>>>>>>>> done today because we don't expose the GSI yet?
>>>>>>>>>>
>>>>>>>>>> I assume this must be new functionality, i.e., this kind of
>>>>>>>>>> passthrough does not work today, right?
>>>>>>>>>>
>>>>>>>>>>>>> has ACPI support and is responsible for detecting and controlling
>>>>>>>>>>>>> the hardware, also it performs privileged operations such as the
>>>>>>>>>>>>> creation of normal (unprivileged) domains DomUs. When we give to a
>>>>>>>>>>>>> DomU direct access to a device, we need also to route the physical
>>>>>>>>>>>>> interrupts to the DomU. In order to do so Xen needs to setup and map
>>>>>>>>>>>>> the interrupts appropriately.
>>>>>>>>>>>>
>>>>>>>>>>>> What kernel interfaces are used for this setup and mapping?
>>>>>>>>>>>
>>>>>>>>>>> For passthrough devices, the setup and mapping of routing physical
>>>>>>>>>>> interrupts to DomU are done on Xen hypervisor side, hypervisor only
>>>>>>>>>>> need userspace to provide the GSI info, see Xen code:
>>>>>>>>>>> xc_physdev_map_pirq require GSI and then will call hypercall to pass
>>>>>>>>>>> GSI into hypervisor and then hypervisor will do the mapping and
>>>>>>>>>>> routing, kernel doesn't do the setup and mapping.
>>>>>>>>>>
>>>>>>>>>> So we have to expose the GSI to userspace not because userspace itself
>>>>>>>>>> uses it, but so userspace can turn around and pass it back into the
>>>>>>>>>> kernel?
>>>>>>>>>
>>>>>>>>> No, the point is to pass it back to Xen, which doesn't know the
>>>>>>>>> mapping between GSIs and PCI devices because it can't execute the ACPI
>>>>>>>>> AML resource methods that provide such information.
>>>>>>>>>
>>>>>>>>> The (Linux) kernel is just a proxy that forwards the hypercalls from
>>>>>>>>> user-space tools into Xen.
>>>>>>>>
>>>>>>>> But I guess Xen knows how to interpret a GSI even though it doesn't
>>>>>>>> have access to AML?
>>>>>>>
>>>>>>> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
>>>>>>> configure the RTE as requested.
>>>>>>
>>>>>> IIUC, mapping a GSI to an IO-APIC pin requires information from the
>>>>>> MADT. So I guess Xen does use the static ACPI tables, but not the AML
>>>>>> _PRT methods that would connect a GSI with a PCI device?
>>>>>
>>>>> Yes, Xen can parse the static tables, and knows the base GSI of
>>>>> IO-APICs from the MADT.
>>>>>
>>>>>> I guess this means Xen would not be able to deal with _MAT methods,
>>>>>> which also contains MADT entries? I don't know the implications of
>>>>>> this -- maybe it means Xen might not be able to use with hot-added
>>>>>> devices?
>>>>>
>>>>> It's my understanding _MAT will only be present on some very specific
>>>>> devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
>>>>> IO-APICs, but hotplug of CPUs should in principle be supported with
>>>>> cooperation from the control domain OS (albeit it's not something that
>>>>> we tests on our CI). I don't expect however that a CPU object _MAT
>>>>> method will return IO APIC entries.
>>>>>
>>>>>> The tables (including DSDT and SSDTS that contain the AML) are exposed
>>>>>> to userspace via /sys/firmware/acpi/tables/, but of course that
>>>>>> doesn't mean Xen knows how to interpret the AML, and even if it did,
>>>>>> Xen probably wouldn't be able to *evaluate* it since that could
>>>>>> conflict with the host kernel's use of AML.
>>>>>
>>>>> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
>>>>> in our context).
>>>>>
>>>>> Getting back to our context though, what would be a suitable place for
>>>>> exposing the GSI assigned to each device?
>>>>
>>>> IIUC, the Xen hypervisor:
>>>>
>>>> - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>>>> something running on the Dom0 kernel) to find the physical base
>>>> address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>>>
>>> No, Xen parses the MADT directly from memory, before stating dom0.
>>> That's a static table so it's fine for Xen to parse it and obtain the
>>> I/O APIC GSI base.
>>
>> It's an interesting split to consume ACPI static tables directly but
>> put the AML interpreter elsewhere.
>
> Well, static tables can be consumed by Xen, because thye don't require
> an AML parser (obviously), and parsing them doesn't have any
> side-effects that would prevent dom0 from being the OSPM (no methods
> or similar are evaluated).
>
>> I doubt the ACPI spec envisioned
>> that, which makes me wonder what other things we could trip over, but
>> that's just a tangent.
>
> Indeed, ACPI is not be best interface for the Xen/dom0 split model.
>
>>>> - Needs the GSI to locate the APIC and pin within the APIC. The
>>>> Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>>>> learn the PCI device -> GSI mapping.
>>>
>>> Yes, Xen doesn't know the PCI device -> GSI mapping. Dom0 needs to
>>> parse the ACPI methods and signal Xen to configure a GSI with a
>>> given trigger and polarity.
>>>
>>>> - Has direct access to the APIC physical base address to program the
>>>> Redirection Table.
>>>
>>> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
>>> accessible by dom0.
>>>
>>>> The patch seems a little messy to me because the PCI core has to keep
>>>> track of the GSI even though it doesn't need it itself. And the
>>>> current patch exposes it on all arches, even non-ACPI ones or when
>>>> ACPI is disabled (easily fixable).
>>>>
>>>> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
>>>> we don't know the GSI unless a Dom0 driver has claimed the device and
>>>> called pci_enable_device() for it, which seems like it might not be
>>>> desirable.
>>>
>>> I think that's always the case, as on dom0 devices to be passed
>>> through are handled by pciback which does enable them.
>>
>> pcistub_init_device() labels the pci_enable_device() as a "HACK"
>> related to determining the IRQ, which makes me think there's not
>> really a requirement for the device to be *enabled* (BAR decoding
>> enabled) by dom0.
>
> No, there's no need for memory decoding to be enabled for getting the
> GSI from the ACPI method I would assume. I'm confused by that
> pci_enable_device() call. Is maybe the purpose to make sure the
> device is powered up so that reading the PCI header Interrupt Line and
> Pin fields returns valid values? No idea whether reading those fields
> requires the device to be in certain (active) power states.
>
>>> I agree it might be best to not tie exposing the node to
>>> pci_enable_device() having been called. Is _PRT only evaluated as
>>> part of acpi_pci_irq_enable()? (or pci_enable_device()).
>>
>> Yes. AFAICT, acpi_pci_irq_enable() is the only path that evaluates
>> _PRT (except for a debugger interface). I don't think it *needs* to
>> be that way, and the fact that we do it per-device like that means we
>> evaluate _PRT many times even though I think the results never change.
>>
>> I could imagine evaluating _PRT once as part of enumerating a PCI host
>> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
>> comment), but that looks like a fair bit of work to implement. And of
>> course it doesn't really affect the question of how to expose the
>> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
>> a possible location.
>
> So you suggest exposing the GSI as part of the PCI host bridge? I'm
> afraid I'm not following how we could then map PCI SBDFs from devices
> to their assigned GSI.
>
> Thanks, Roger.

--
Best regards,
Jiqian Chen.