Re: [PATCH v2] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt

From: Stephen Boyd
Date: Tue Mar 17 2020 - 17:36:11 EST


Quoting Douglas Anderson (2020-03-17 13:37:06)
> Every once in a while (like once in a few months on a device) people
> have seen warnings on devices using spi-geni-qcom like:
>
> irq ...: nobody cared (try booting with the "irqpoll" option)
>
> ...where the interrupt number listed matches up with "spi_geni" in
> /proc/interrupts.
>
> You can get "nobody cared" if the interrupt handler returns IRQ_NONE.
> Once you get into this state the driver will just stop working.
>
> Auditing the code makes me believe that it's probably not so good
> checking "cur_mcmd" in the interrupt handler not under spinlock.
> Let's move it to be under spinlock. Looking more closely at the code,
> it looks as if there are some other places that could be under
> spinlock, so let's add. It looks as if the original code was assuming
> that by the time that the interrupt handler got called that the write
> to "cur_mcmd" (to make it not CMD_NONE anymore) would make it to the
> processor handling the interrupt. Perhaps with weakly ordered memory
> this sometimes (though very rarely) happened.
>
> Let's also add a warning (with the IRQ status) in the case that we
> ever end up getting an interrupt when we think we shouldn't. This
> would give us a better chance to debug if this patch doesn't help the
> issue. We'll also try our best to clear the interrupt to hopefully
> get us out of this state.
>
> Patch is marked "speculative" because I have no way to reproduce this
> problem, so I'm just hoping this fixes it. Weakly ordered memory
> makes my brain hurt.

It could be that. It could also be the poor design of geni_se_init() and
how it enables many interrupts that this driver doesn't look to handle?
Why do we allow the wrapper to enable all those bits in
M_COMMON_GENI_M_IRQ_EN and S_COMMON_GENI_S_IRQ_EN if nobody is going to
handle them?

>
> Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> ---
>
> Changes in v2:
> - Detect true spurious interrupt.
> - Still return IRQ_NONE for state machine mismatch, but print warn.
>
> drivers/spi/spi-geni-qcom.c | 35 ++++++++++++++++++++++++++++++-----
> 1 file changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index c3972424af71..92e51ccb427f 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
> struct geni_se *se = &mas->se;
> unsigned long time_left;
>
> - reinit_completion(&mas->xfer_done);
> pm_runtime_get_sync(mas->dev);
> if (!(slv->mode & SPI_CS_HIGH))
> set_flag = !set_flag;
>
> + spin_lock_irq(&mas->lock);

Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible
to be called from somewhere that hasn't changed irq flags?

> + reinit_completion(&mas->xfer_done);
> +
> mas->cur_mcmd = CMD_CS;
> if (set_flag)
> geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
> else
> geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0);
> + spin_unlock_irq(&mas->lock);

This will force on interrupts if they were masked.

>
> time_left = wait_for_completion_timeout(&mas->xfer_done, HZ);
> if (!time_left)
> @@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
> u32 spi_tx_cfg, len;
> struct geni_se *se = &mas->se;
>
> + spin_lock_irq(&mas->lock);
> +
> spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG);
> if (xfer->bits_per_word != mas->cur_bits_per_word) {
> spi_setup_word_len(mas, mode, xfer->bits_per_word);
> @@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
> */
> if (m_cmd & SPI_TX_ONLY)
> writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
> +
> + spin_unlock_irq(&mas->lock);
> }
>
> static int spi_geni_transfer_one(struct spi_master *spi,
> @@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
> struct geni_se *se = &mas->se;
> u32 m_irq;
> unsigned long flags;
> -
> - if (mas->cur_mcmd == CMD_NONE)
> - return IRQ_NONE;
> + irqreturn_t ret = IRQ_HANDLED;
>
> spin_lock_irqsave(&mas->lock, flags);
> m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

Does this read need to be inside the lock?

>
> + /* Check for spurious interrupt */
> + if (!m_irq) {
> + ret = IRQ_NONE;
> + goto exit;

I ask because it could be simplified by reading the status and then
immediately returning IRQ_NONE if no bits are set without having to do
the lock/unlock dance. And does writing 0 to the irq clear register do
anything?

> + }
> +
> + /*
> + * If we got a real interrupt but software state machine thinks
> + * we were idle then give a warning. We'll do our best to clear
> + * the interrupt but we'll still return IRQ_NONE. If this keeps
> + * happening the system will eventually disable us.
> + */
> + if (mas->cur_mcmd == CMD_NONE) {
> + pr_warn("Unexpected SPI interrupt: %#010x\n", m_irq);
> + ret = IRQ_NONE;
> + goto exit;
> + }
> +
> if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN))
> geni_spi_handle_rx(mas);
>
> @@ -523,9 +546,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
> complete(&mas->xfer_done);
> }
>
> +exit:
> writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
> spin_unlock_irqrestore(&mas->lock, flags);
> - return IRQ_HANDLED;
> +
> + return ret;
> }
>
> static int spi_geni_probe(struct platform_device *pdev)