Re: [PATCH V2] x86/apic/msi: Plug non-maskable MSI affinity race
From: Evan Green
Date: Fri Jan 31 2020 - 15:33:21 EST
On Fri, Jan 31, 2020 at 6:27 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:
>
> Evan tracked down a subtle race between the update of the MSI message and
> the device raising an interrupt internally on PCI devices which do not
> support MSI masking. The update of the MSI message is non-atomic and
> consists of either 2 or 3 sequential 32bit wide writes to the PCI config
> space.
>
> - Write address low 32bits
> - Write address high 32bits (If supported by device)
> - Write data
>
> When an interrupt is migrated then both address and data might change, so
> the kernel attempts to mask the MSI interrupt first. But for MSI masking is
> optional, so there exist devices which do not provide it. That means that
> if the device raises an interrupt internally between the writes and MSI
> message is sent built from half updated state.
>
> On x86 this can lead to spurious interrupts on the wrong interrupt
> vector when the affinity setting changes both address and data. As a
> consequence the device interrupt can be lost causing the device to
> become stuck or malfunctioning.
>
> Evan tried to handle that by disabling MSI accross an MSI message
> update. That's not feasible because disabling MSI has issues on its own:
>
> If MSI is disabled the PCI device is routing an interrupt to the legacy
> INTx mechanism. The INTx delivery can be disabled, but the disablement is
> not working on all devices.
>
> Some devices lose interrupts when both MSI and INTx delivery are disabled.
>
> Another way to solve this would be to enforce the allocation of the same
> vector on all CPUs in the system for this kind of screwed devices. That
> could be done, but it would bring back the vector space exhaustion problems
> which got solved a few years ago.
>
> Fortunately the high address (if supported by the device) is only relevant
> when X2APIC is enabled which implies interrupt remapping. In the interrupt
> remapping case the affinity setting is happening at the interrupt remapping
> unit and the PCI MSI message is programmed only once when the PCI device is
> initialized.
>
> That makes it possible to solve it with a two step update:
>
> 1) Target the MSI msg to the new vector on the current target CPU
>
> 2) Target the MSI msg to the new vector on the new target CPU
>
> In both cases writing the MSI message is only changing a single 32bit word
> which prevents the issue of inconsistency.
>
> After writing the final destination it is necessary to check whether the
> device issued an interrupt while the intermediate state #1 (new vector,
> current CPU) was in effect.
>
> This is possible because the affinity change is always happening on the
> current target CPU. The code runs with interrupts disabled, so the
> interrupt can be detected by checking the IRR of the local APIC. If the
> vector is pending in the IRR then the interrupt is retriggered on the new
> target CPU by sending an IPI for the associated vector on the target CPU.
>
> This can cause spurious interrupts on both the local and the new target
> CPU.
>
> 1) If the new vector is not in use on the local CPU and the device
> affected by the affinity change raised an interrupt during the
> transitional state (step #1 above) then interrupt entry code will
> ignore that spurious interrupt. The vector is marked so that the
> 'No irq handler for vector' warning is supressed once.
>
> 2) If the new vector is in use already on the local CPU then the IRR check
> might see an pending interrupt from the device which is using this
> vector. The IPI to the new target CPU will then invoke the handler of
> the device, which got the affinity change, even if that device did not
> issue an interrupt
>
> 3) If the new vector is in use already on the local CPU and the device
> affected by the affinity change raised an interrupt during the
> transitional state (step #1 above) then the handler of the device which
> uses that vector on the local CPU will be invoked.
>
> #1 is uninteresting and has no unintended side effects. #2 and #3 might
> expose issues in device driver interrupt handlers which are not prepared to
> handle a spurious interrupt correctly. This not a regression, it's just
> exposing something which was already broken as spurious interrupts can
> happen for a lot of reasons and all driver handlers need to be able to deal
> with them.
>
> Reported-by: Evan Green <evgreen@xxxxxxxxxxxx>
> Debugged-by: Evan Green <evgreen@xxxxxxxxxxxx> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Heh, thanks for the credit. Something weird happened on this line with
your signoff, though.
I've been running this on my system for a few hours with no issues
(normal repro in <1 minute). So,
Tested-by: Evan Green <evgreen@xxxxxxxxxxxx>