Re: [EXTERNAL] Re: [PATCH] x86/ioapic: Don't return 0 as valid virq
From: Saurabh Singh Sengar
Date: Fri Mar 24 2023 - 12:24:12 EST
On Fri, Mar 24, 2023 at 04:39:07PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 14 2023 at 10:23, Saurabh Singh Sengar wrote:
> >> This should be added to your commit message: what guest VM is that and
> >> why should the kernel support it.
> >
> > Guest VM is a linux VM running as child partition on Hyper-V. Hyper-v Linux
> > documentation is in Documentation/virt/hyperv/.
> >
> > In commit I wanted to mention that any system which is not registering
> > IO-APIC will have this issue. But I am fine to mention specifically
> > about the issue I am facing. As part of your next comment, I have
> > explained the issue in detail if that is good, I can put that as
> > commit message.
> >>
> >> Why doesn't it need an IO-APIC and why does the current code need to be
> >> changed just for your guest VM?
> >
> > For Hyper-V Virtual Machines, few platforms don't have any devices to be
> > hooked to IO-APIC. Although it has Hyper-V based MSI over VMBus which
> > assigns interrupts to PCIe devices. In such platforms IO-APIC is not
> > registered which causes gsi_top value to remain at 0 and not get properly
> > assigned. Moreover, due to the inability to disable CONFIG_X86_IO_APIC
> > flag, the io-apic code still gets compiled. Thus, arch_dynirq_lower_bound
> > function in io_apic.c decides the lower bound of irq numbers based on gsi_top.
> >
> > Later when PCIe-MSI attempts to allocate interrupts, it gets 0 as the first
> > virq number because gsi_top is still 0. 0 being invalid virq is ignored by
> > MSI irq domain and results allocation of the same PCIe MSI twice.
> >
> > CPU0 CPU1
> > 0: 2 0 Hyper-V PCIe MSI 1073741824-edge
> > 1: 69 0 Hyper-V PCIe MSI 1073741824-edge nvme0q0
> >
> > To avoid this issue, if IO-APIC and gsi_top are not initialized, return the
> > hint value passed as 'from' value to arch_dynirq_lower_bound instead of 0.
> > This will also be identical to the behaviour of weak arch_dynirq_lower_bound
> > function defined in kernel/softirq.c.
>
> I find this mightly confusing. Something like this perhaps:
>
> Subject: x86/ioapic: Don't return 0 from arch_dynirq_lower_bound()
>
> arch_dynirq_lower_bound() is invoked by the core interrupt code to
> retrieve the lowest possible Linux interrupt number for dynamically
> allocated interrupts like MSI.
>
> The x86 implementation uses this to exclude the IO/APIC GSI space.
> This works correctly as long as there is an IO/APIC registered, but
> returns 0 if not. This has been observed in VMs where the BIOS does
> not advertise an IO/APIC.
>
> 0 is an invalid interrupt number except for the legacy timer interrupt
> on x86. The return value is unchecked in the core code, so it ends up
> to allocate interrupt number 0 which is subsequently considered to be
> invalid by the caller, e.g. the MSI allocation code.
>
> The function has already a check for 0 in the case that an IO/APIC is
> registered, but ioapic_dynirq_base is 0 in case of device tree setups.
>
> Consolidate this and zero check for both ioapic_dynirq_base and gsi_top,
> which is used in the case that no IO/APIC is registered.
>
> And then make the code to look like the below, which makes it very
> clear what this is about.
>
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2477,17 +2477,22 @@ static int io_apic_get_redir_entries(int
>
> unsigned int arch_dynirq_lower_bound(unsigned int from)
> {
> + unsigned int ret;
> +
> /*
> * dmar_alloc_hwirq() may be called before setup_IO_APIC(), so use
> * gsi_top if ioapic_dynirq_base hasn't been initialized yet.
> */
> - if (!ioapic_initialized)
> - return gsi_top;
> + ret = ioapic_dynirq_base ? : gsi_top;
> +
> /*
> - * For DT enabled machines ioapic_dynirq_base is irrelevant and not
> - * updated. So simply return @from if ioapic_dynirq_base == 0.
> + * For DT enabled machines ioapic_dynirq_base is irrelevant and
> + * always 0. gsi_top can be 0 if there is no IO/APIC registered.
> + *
> + * 0 is an invalid interrupt number for dynamic allocations. Return
> + * @from instead.
> */
> - return ioapic_dynirq_base ? : from;
> + return ret ? : from;
> }
>
> #ifdef CONFIG_X86_32
>
Thanks you for your valuable suggestions. Commit message and code looks
much better now. I will send the V2 with your "Co-Developed-by" tag, I
hope its fine with you.
Regards,
Saurabh