Re: [PATCH v5 01/20] EDAC/synopsys: Fix ECC status data and IRQ disable race condition

From: Borislav Petkov
Date: Mon Apr 15 2024 - 14:36:54 EST


On Thu, Feb 22, 2024 at 09:12:46PM +0300, Serge Semin wrote:
> The race condition around the ECCCLR register access happens in the IRQ
> disable method called in the device remove() procedure and in the ECC IRQ
> handler:
> 1. Enable IRQ:
> a. ECCCLR = EN_CE | EN_UE
> 2. Disable IRQ:
> a. ECCCLR = 0
> 3. IRQ handler:
> a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT
> b. ECCCLR = 0
> c. ECCCLR = EN_CE | EN_UE
> So if the IRQ disabling procedure is called concurrently with the IRQ
> handler method the IRQ might be actually left enabled due to the
> statement 3c.
>
> The root cause of the problem is that ECCCLR register (which since v3.10a
> has been called as ECCCTL) has intermixed ECC status data clear flags and
> the IRQ enable/disable flags. Thus the IRQ disabling (clear EN flags) and
> handling (write 1 to clear ECC status data) procedures must be serialised
> around the ECCCTL register modification to prevent the race.
>
> So fix the problem described above by adding the spin-lock around the
> ECCCLR modifications and preventing the IRQ-handler from modifying the
> IRQs enable flags (there is no point in disabling the IRQ and then
> re-enabling it again within a single IRQ handler call, see the statements
> 3a/3b and 3c above).

So I'm looking at the code and am looking at this and wondering how we
even ended up in this mess?!

An interrupt handler should not *enable* the interrupt again - that's
just crazy. And I should've seen that in

4bcffe941758 ("EDAC/synopsys: Re-enable the error interrupts on v3 hw")

and stopped it right there. But well, it is what it is...

So I'd like to see the following flow:

* on init, the interrupt is enabled with enable_intr() *after*
registering the interrupt handler.

* on exit, the interrupt is disabled with disable_intr() and then no
interrupts are coming in anymore.

And then I don't think you'll need the spinlock and it'll be sane
design.

Right?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette