Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr

From: Doug Anderson
Date: Thu Dec 10 2020 - 12:15:31 EST


Hi,

On Wed, Dec 9, 2020 at 7:17 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
>
> Quoting Doug Anderson (2020-12-03 08:40:46)
>
> > I would guess that if "mas->cur_xfer" is NULL then
> > geni_spi_handle_rx() should read all data in the FIFO and throw it
> > away and geni_spi_handle_tx() should set SE_GENI_TX_WATERMARK_REG to
> > 0. NOTE: I _think_ that with the synchronize_irq() I'm suggesting
> > above we'll avoid this case, but it never hurts to be defensive.
> >
> >
> > Does that all make sense? So the summary is that instead of your patch:
>
> Can we get a CPU diagram describing the race and scenario where this
> happens? Something like:
>
> CPU0 CPU1
> ---- ----
> setup_fifo_xfer()
> spin_lock_irq(&mas->lock);
> spin_unlock_irq(&mas->lock);
> mas->cur_xfer = xfer
> ...
> <IRQ>
> geni_spi_isr()
> geni_spi_handle_rx()
> <NULL deref boom explosion!>
>
> But obviously this example diagram is incorrect and some timeout happens
> instead? Sorry, I'm super lazy and don't want to read many paragraphs of
> text. :) I'd rather have a diagram like above that clearly points out
> the steps taken to the NULL pointer deref.

This is my untested belief of what's happening

CPU0 CPU1
---- ----
setup_fifo_xfer()
...
geni_se_setup_m_cmd()
<hardware starts transfer>
<unrelated interrupt storm> spin_unlock_irq()
<continued interrupt storm> <time passes>
<continued interrupt storm> <transfer complets in hardware>
<continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN>
<continued interrupt storm> <time passes>
<continued interrupt storm> handle_fifo_timeout()
<continued interrupt storm> spin_lock_irq()
<continued interrupt storm> mas->cur_xfer = NULL
<continued interrupt storm> geni_se_cancel_m_cmd()
<continued interrupt storm> spin_unlock_irq()
<continued interrupt storm> wait_for_completion_timeout() => timeout
<continued interrupt storm> spin_lock_irq()
<continued interrupt storm> geni_se_abort_m_cmd()
<continued interrupt storm> spin_unlock_irq()
<continued interrupt storm> wait_for_completion_timeout() => timeout
<interrupt storm ends>
geni_spi_isr()
spin_lock()
if (m_irq & M_RX_FIFO_WATERMARK_EN)
geni_spi_handle_rx()
mas->cur_xfer NULL derefrence

With my proposed fix, I believe that would transform into:

CPU0 CPU1
---- ----
setup_fifo_xfer()
...
geni_se_setup_m_cmd()
<hardware starts transfer>
<unrelated interrupt storm> spin_unlock_irq()
<continued interrupt storm> <time passes>
<continued interrupt storm> <transfer complets in hardware>
<continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN>
<continued interrupt storm> <time passes>
<continued interrupt storm> handle_fifo_timeout()
<continued interrupt storm> synchronize_irq()
<continued interrupt storm> <time passes>
<interrupt storm ends>
geni_spi_isr()
...
<synchronize_irq() finishes>
spin_lock_irq()
mas->cur_xfer = NULL
geni_se_cancel_m_cmd()
spin_unlock_irq()
geni_spi_isr()
...
wait_for_completion_timeout() => success

The extra synchronize_irq() I was suggesting at the end of the
function would be an extra bit of paranoia. Maybe a new storm showed
up while we were processing the timeout?


> > 1. Add synchronize_irq() at the start and end of
> > handle_fifo_timeout(). Not under lock.
> >
> > 2. In geni_spi_handle_rx(), check for NULL "mas->cur_xfer". Read all
> > data in the FIFO (don't cap at rx_rem_bytes), but throw it away.
> >
> > 3. In geni_spi_handle_tx(), check for NULL "mas->cur_xfer". Don't
> > write any data. Just write 0 to SE_GENI_TX_WATERMARK_REG.
> >
> > I think #1 is the real fix, but #2 and #3 will avoid crashes in case
> > there's another bug somewhere.
>
> Aren't 2 and 3 papering over some weird problem though where irqs are
> coming in unexpectedly?

I think that's what I said but in different words? #1 is the real fix
but #2 and #3 will keep us from crashing (AKA paper over) if we have
some other (unexpected) bug. We'll already have an error in the log
in this case "Failed to cancel/abort m_cmd" so it doesn't feel
necessary to crash with a NULL dereference...

-Doug