Re: [PATCH v2] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt
From: Doug Anderson
Date: Wed Mar 18 2020 - 16:10:39 EST
Hi,
On Wed, Mar 18, 2020 at 1:04 AM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
>
> > * Most of the cases I saw of the "nobody cared" for geni-spi was on a
> > mostly idle system (presumably still doing periodic SPI transactions
> > to the EC, though). It seems weird that one of these other interrupts
> > would suddenly fire. It seems more likely that we just happened to
> > win a race of some sort.
>
> Sure, I'm mostly interested in understanding what that race is. I think
> it is something like cur_mcmd being stale when it's tested in the irq
> handler because the IO access triggers the irq before mas->cur_mcmd can
> be updated due to out of order execution. Nothing stops
> spi_geni_set_cs() from being reordered like this, especially because
> there isn't any sort of barrier besides the compiler barrier of asm
> volatile inside geni_set_setup_m_cmd():
>
> CPU0 (device write overtakes CPU) CPU1
> ---- ----
> mas->cur_mcmd = CMD_NONE;
> geni_se_setup_m_cmd(...)
> geni_spi_isr()
> <tests cur_mcmd and it's still NONE>
> mas->cur_mcmd = CMD_CS;
This is my assumption about what must have happened, yes. I have no
proof, though.
> > > > + /* Check for spurious interrupt */
> > > > + if (!m_irq) {
> > > > + ret = IRQ_NONE;
> > > > + goto exit;
> > >
> > > I ask because it could be simplified by reading the status and then
> > > immediately returning IRQ_NONE if no bits are set without having to do
> > > the lock/unlock dance. And does writing 0 to the irq clear register do
> > > anything?
> >
> > Sure. I'll move it if you want. It felt nicer to just keep the whole
> > thing under lock so I didn't have to think about whether it mattered.
> > ...and anyone else looking at it didn't need to think about if it
> > mattered, too. It it is very easy to say that it doesn't _hurt_ to
> > have it under lock other than having one extra memory read under lock.
> > ...and presumably the case where the optimization matters is
> > incredibly rare (a spurious interrupt) and we just spent a whole lot
> > of cycles calling the interrupt handler to begin with for this
> > spurious interrupt.
> >
> > I would have a hard time believing that a write of 0 to a "write 1 to
> > clear" register would have any impact. It would be a pretty bad
> > hardware design...
>
> This is a writel to a device so it may take some time for the memory
> barrier to drain any other writes with the full memory barrier in there.
> Not sure if we care about getting super high performance here but that's
> a concern.
Right, but I guess the spurious interrupt case wasn't one I'm
optimizing for. ...and if it's a real problem we should figure out
how to avoid so many spurious interrupts. Presumably all the overhead
that the Linux kernel did to call our interrupt handler in the first
place is much more serious.
> > So I guess in summary, I'm not planning to spin for any of these
> > things unless you really insist or you say I'm wrong about something
> > above or someone else says my opinions are the wrong opinions.
> >
>
> Why do we need such large locking areas?
We probably don't, but it's unlikely to matter at all and the large
locking areas mean we don't have to spend time thinking about it.
It's super clear that the locking is safe.
In general there shouldn't be tons of downside to have large locking
areas because there shouldn't be contention for this lock. As you
said, much of this is synchronized at higher levels, so mostly we're
just worried about synchronizing with our interrupt handler and one
would hope the interrupt handler wasn't firing at times we don't
expect.
...but I guess there is the downside that local interrupts are fully
disabled for this whole large locking area and that does feel like
something that might be worth optimizing for...
In theory we _could_ just have a lock around updating 'cur_mcmd' and
kicking off the transfer. That would make sure that 'cur_mcmd' was
definitely updated by the time the interrupt fired. It feels at least
a little awkward to me, though, because there is more shared state
between the interrupt handler and the functions kicking off the
transfers and we're really just using the lock as a poor man's memory
barrier, but maybe it's OK. I can give a shot at this approach if it
feels viable to you.
> Why can't we remove the enum
> software tracking stuff and look at mas->cur_xfer to figure out if a cs
> change is happening or not? I suppose mas->cur_xfer could be stale, so
> we need to make sure that is modified and checked under a lock, and then
> mas->rx_rem_bytes/tx_rem_bytes also need to be under a lock because
> they're modified inside and outside the irq handler, but otherwise I
> don't see any need to hold the lock over the register reads/writes. Most
> other things are synchronized with the completions or in higher layers
> of the SPI core.
In general it's a big-ish change to code that's largely already tested
and that (IMO) it makes readability slightly worse. It can be worth
it to make big changes for big gain, but I don't see it here. It
might have a slight performance boost, but seems like premature
optimization since there's no evidence that the existing performance
(even after my big locking patch) is a problem.
If we want to go forward with your suggestion, I'd have to spend more
time thinking about the cancel/abort cases, especially, which I'm not
sure how to test super well. Specifically does cancel/abort cause the
"done" bit to be set? If so your new code will finalize the current
transfer, but the old code didn't. I have no idea if this matters.
You also removed the spinlock from handle_fifo_timeout() and I don't
know if (for instance) it was important to clear the TX watermark
register after firing off the abort command but before the interrupt
handler could run. It also doesn't necessarily appear that
reinit_completion() has any safety, so is there some chance that the
interrupt could fire and we'd still have the re-init pending on our
side, then clear "done" back to zero after the interrupt handler
completed us?
...one extra thing to think about with all this is that (I think)
handle_fifo_timeout() will be called if someone kicks off a SPI
transfer and then hits "Ctrl-C" partway. In this case
spi_transfer_wait() will return -EINTR and I think that will change
the chip select and then call "handle_err". I have no idea if that
works well today, but I have a feeling it will be harder to get right
without an extra state variable.
> I do wonder if we need to hold the lock again when we update the rx/tx
> remaining bytes counter but I can't convince myself that the irq can run
> on another CPU in parallel with the CPU that ran the irq first. That
> seems impossible given that the interrupt would have to come again and
> the irq controller would need to send it to a different CPU at the same
> time. We should only need to grab the lock to make sure that cur_xfer
> and remaining bytes is published by the CPU triggering the rx/tx
> transfers and then assume the irq handler code is synchronous with
> itself.
At least needs a comment?
...but even with a comment, what exactly are you gaining here by
making me and all future readers of the code think about this? One of
your main goals here is to eliminate the need to have the spin lock
locked for the whole of the IRQ handler, right? If we're in the
interrupt handler, local interrupts are already off, right? That
means that the only thing you can gain (in terms of speed) by
decreasing the number of lines "under lock" is that you could allow
someone else trying to get the same lock to run sooner. As you've
pointed out, there's already a lot of synchronization at higher
layers, and many of the arguments you make are assuming that the other
code isn't running anyway.
So this overall seems like trying to shorten the amount of time that
the lock is held just for the sheer principle of having spin locks
held for the fewest lines of code, but not for any real gain?
To get into micro-optimizations, too, I will also say that having fine
grained locking in the interrupt handler also will lead to more than
one call to spin_lock() per IRQ. I put traces in and I often saw "rx"
and "cmd_done" fall together and both separately lock in your code.
In general isn't reducing the number of calls to spin_lock() (which I
think has some overhead) more important?
> - if ((m_irq & M_CMD_CANCEL_EN) || (m_irq & M_CMD_ABORT_EN)) {
> - mas->cur_mcmd = CMD_NONE;
> + /* Wake up spi_geni_set_cs() or handle_fifo_timeout() */
> + if (cs_change || (m_irq & M_CMD_CANCEL_EN) || (m_irq & M_CMD_ABORT_EN))
> complete(&mas->xfer_done);
> - }
>
> writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
> - spin_unlock_irqrestore(&mas->lock, flags);
> +
It doesn't feel legit to release the spinlock before this writel().
Once you complete(), can't (in theory) another transfer kick off on
another CPU, and then cause an interrupt to fire? You'd be clearing
someone else's interrupt, perhaps? Maybe this actually happens in
handle_fifo_timeout() which (after your patch) waits for the
completion and then immediate (with no locking) fires off an abort.
As an aisde and unrelated to your change, I never like it when
interrupt handler clear their IRQ bits at the end anyway because it
always feels like a race. The only time you clear at the end is if
you have "latched level" interrupts (interrupts that are latched but
will continually re-assert unless the thing that caused them goes
away). I have no idea of these bits are "latched level", but it feels
unlikely.
NOTE: I didn't do a complete review of your patch because (I think) I
already found a few issues and am still not convinced it's a good use
of time to fully re-architect the control flow of this driver. If you
still think this is something we need to do, I will try to make time
to work through all the corner cases in my head and try my best to
make sure there are no weird memory-ordering issues, but I can't
promise I won't miss anything.
-Doug