Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

From: Chen, Jiqian
Date: Tue Dec 05 2023 - 04:39:36 EST



On 2023/12/5 17:19, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>
>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>> when thay are enabled.
>>>>>>>>>
>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>> mapped into a domU.
>>>>>>>>>
>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>> unmasked.
>>>>>>>>>
>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>> hvm_dpci_eoi(d, gsi);
>>>>>>>> }
>>>>>>>>
>>>>>>>> - if ( is_hardware_domain(d) && unmasked )
>>>>>>>> + if ( is_hardware_domain(d) )
>>>>>>>> {
>>>>>>>> /*
>>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>
>>>>>>> There are some issues with this approach.
>>>>>>>
>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>> to assert that the configuration is the intended one. A guest is
>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>
>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>
>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>
>>>>>> That could work.
>>>>>>
>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>
>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>> parameter would be a GSI instead of a previously mapped IRQ). Such
>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>> this syntax regardless of the domain type it's called from.
>>>>
>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>> we don't need to wait until polarity and trigger are set appropriately
>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>
>>> The logic is certainly not straightforward, and it could benefit from
>>> some comments.
>>>
>>> The irq permissions are a bit special, in that they get setup when the
>>> IRQ is mapped.
>>>
>>> The problem however is not so much with IRQ permissions, that we can
>>> indeed sort out internally in Xen. Such check in dom0 has the side
>>> effect of preventing the IRQ from being assigned to a domU without the
>>> hardware source being properly configured AFAICT.
>>
>> Now I understand why you made a comment previously about Xen having to
>> configure trigger and polarity for these interrupts on its own.
>>
>>
>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>> polarity to be configured to work. But the suggestion of doing it a
>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>
>>>> But maybe we can find another location, maybe within
>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>
>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>> already been done.
>>>
>>> But then we would end up in a situation where the
>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>> configured, which I think it's not what we want.
>>>
>>> One option would be to allow mp_register_gsi() to be called multiple
>>> times, and update the IO-APIC pin configuration as long as the pin is
>>> not unmasked. That would propagate each dom0 RTE update to the
>>> underlying IO-APIC. However such approach relies on dom0 configuring
>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>> think it's not a very reliable option.
>>>
>>> Another option would be to modify the toolstack to setup the GSI
>>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
>>> email, since we only care about PCI device passthrough the legacy INTx
>>> should always be level triggered and low polarity.
>>>
>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>> thinking:
>>>>
>>>>
>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>> index 4e40d3609a..16d56fe851 100644
>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>> return ret;
>>>> }
>>>>
>>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> - if ( ret )
>>>> - {
>>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> - gsi, ret);
>>>> - return ret;
>>>> - }
>>>> -
>>>> pcidevs_lock();
>>>> ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>> if ( ret )
>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>> hvm_dpci_eoi(d, gsi);
>>>> }
>>>>
>>>> + if ( is_hardware_domain(d) )
>>>> + {
>>>> + int pirq = gsi, ret;
>>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> + if ( ret )
>>>> + {
>>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> + gsi, ret);
>>>> + return ret;
>>>> + }
>>>> + }
>>>> if ( is_hardware_domain(d) && unmasked )
>>>> {
>>>> /*
>>>
>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>> of likely each IO-APIC pin, which is IMO not quite reliable. In there
>>> are two different issues here that need to be fixed for PVH dom0:
>>>
>>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>> succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>
>> Yes makes sense
>>
>>
>>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>> the IO-APIC pin itself.
>>>
>>> First one needs to be fixed internally in Xen, second one will require
>>> the toolstack to issue an extra hypercall in order to ensure the
>>> IO-APIC pin is properly configured.
>>
>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>> correctly. Xen configures them all on its own at boot based on Device
>> Tree information. I guess it is not possible to do the same on x86?
>
> No, not exactly. There's some interrupt information in the ACPI MADT,
> but that's just for very specific sources (Interrupt Source Override
> Structures)
>
> Then on AML devices can have resource descriptors that contain
> information about how interrupts are setup. However Xen is not able
> to read any of this information on AML.
>
> Legacy PCI interrupts are (always?) level triggered and low polarity,
> because it's assumed that an interrupt source can be shared between
> multiple devices.
>
> I'm however not able to find any reference to this in the PCI spec,
> hence I'm reluctant to take this for granted in Xen, and default all
> GSIs >= 16 to such mode.
>
> OTOH legacy PCI interrupts are not that used anymore, as almost all
> devices will support MSI(-X) (because PCIe mandates it) and OSes
> should prefer the latter. SR-IOV VF don't even support legacy PCI
> interrupts anymore.
>
>> If
>> not, then I can see why we would need 1 extra toolstack hypercall for
>> that (or to bundle the operation of configuring IO-APIC pins together
>> with an existing toolstack hypercall).
>
> One suitable compromise would be to default unconfigured GSIs >= 16 to
> level-triggered and low-polarity, as I would expect that to work in
> almost all cases. We can always introduce the usage of
> PHYSDEVOP_setup_gsi later if required.
>
> Maybe Jan has more input here, would you agree to defaulting non-ISA
> GSIs to level-triggered, low-polarity in the absence of a specific
> setup provided by dom0?
>
> Thanks, Roger.

No intention to disturb if I am incorrect, just a little input. On dom0 PVH, when it enables devices, it will call acpi_pci_irq_enable, and in that function, its default trigger is level and polarity is low for pci interrupt.

--
Best regards,
Jiqian Chen.