Re: fsl_espi errors on v5.7.15
From: Chris Packham
Date: Tue Sep 01 2020 - 19:31:55 EST
On 2/09/20 11:29 am, Chris Packham wrote:
>
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>>> <snip>
>>>>>>>
>>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the
>>>>>>>>>>>>> T2080RDB
>>>>>>>>>>>>>
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>>
>>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra
>>>>>>>>>>>>> byte in the
>>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a
>>>>>>>>>>>>> READ_FSR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>>
>>>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got
>>>>>>>>>>>>> access to it is
>>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea
>>>>>>>>>>>>> why it
>>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>>> So I think I've got a reproduction and I think I've
>>>>>>>>>>>> bisected the problem
>>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft
>>>>>>>>>>>> interrupt replay in
>>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too
>>>>>>>>>>>> much scrutiny
>>>>>>>>>>>> to this result. Given the various rabbit holes I've been
>>>>>>>>>>>> down on this
>>>>>>>>>>>> issue already I'd take this information with a good degree
>>>>>>>>>>>> of skepticism.
>>>>>>>>>>>>
>>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>>> It doesn't have yet the change you're referring to, and the
>>>>>>>>>>> fsl-espi driver
>>>>>>>>>>> is basically the same as in 5.7 (just two small changes in
>>>>>>>>>>> 5.7).
>>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related
>>>>>>>>>> patches
>>>>>>>>>> around this time that could affect book E, so it's good if
>>>>>>>>>> that exact
>>>>>>>>>> patch is confirmed.
>>>>>>>>> My confirmation is basically that I can induce the issue in a
>>>>>>>>> 5.4 kernel
>>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the
>>>>>>>>> issue in
>>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>>
>>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I
>>>>>>>>> also
>>>>>>>>> confirmed the bisection result by building at 3282a3da25bd
>>>>>>>>> (which sees
>>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>>
>>>>>>>> [snip patch]
>>>>>>>>
>>>>>>>>> I still saw the issue with this change applied.
>>>>>>>>> PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>>> else has changed.
>>>>>>>>
>>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>>> higher interrupt latency?
>>>>>>>>
>>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>>> (it's supposed to be a bit better if anything because it
>>>>>>>> doesn't set
>>>>>>>> up the full interrupt frame). But it's possible.
>>>>>>> My working theory is that the SPI_DON indication is all about
>>>>>>> the TX
>>>>>>> direction an now that the interrupts are faster we're hitting an
>>>>>>> error
>>>>>>> because there is still RX activity going on. Heiner disagrees
>>>>>>> with my
>>>>>>> interpretation of the SPI_DON indication and the fact that it
>>>>>>> doesn't
>>>>>>> happen every time does throw doubt on it.
>>>>>>>
>>>>>> It's right that the eSPI spec can be interpreted that SPI_DON
>>>>>> refers to
>>>>>> TX only. However this wouldn't really make sense, because also
>>>>>> for RX
>>>>>> we program the frame length, and therefore want to be notified
>>>>>> once the
>>>>>> full frame was received. Also practical experience shows that
>>>>>> SPI_DON
>>>>>> is set also after RX-only transfers.
>>>>>> Typical SPI NOR use case is that you write read command + start
>>>>>> address,
>>>>>> followed by a longer read. If the TX-only interpretation would be
>>>>>> right,
>>>>>> we'd always end up with SPI_DON not being set.
>>>>>>
>>>>>>> I can't really explain the extra RX byte in the fifo. We know
>>>>>>> how many
>>>>>>> bytes to expect and we pull that many from the fifo so it's not
>>>>>>> as if
>>>>>>> we're missing an interrupt causing us to skip the last byte.
>>>>>>> I've been
>>>>>>> looking for some kind of off-by-one calculation but again if it
>>>>>>> were
>>>>>>> something like that it'd happen all the time.
>>>>>>>
>>>>>> Maybe it helps to know what value this extra byte in the FIFO
>>>>>> has. Is it:
>>>>>> - a duplicate of the last read byte
>>>>>> - or the next byte (at <end address> + 1)
>>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>>> The values were up thread a bit but I'll repeat them here
>>>>>
>>>>> fsl_espi ffe110000.spi: tx 70
>>>>> fsl_espi ffe110000.spi: rx 03
>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>
>>>>>
>>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen
>>>>> them with
>>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>>
>>>> Just remembered something about SPIE_DON:
>>>> Transfers are always full duplex, therefore in case of a read the chip
>>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>>> that the last dummy zero was shifted out.
>>>>
>>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1
>>>> byte in.
>>>> So the issue may have a dependency on the length of the transfer.
>>>> However I see no good explanation so far. You can try adding a
>>>> delay of
>>>> a few miroseconds between the following to commands in
>>>> fsl_espi_bufs().
>>>>
>>>> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>>
>>>> /* Prevent filling the fifo from getting interrupted */
>>>> spin_lock_irq(&espi->lock);
>>>>
>>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too
>>>> close.
>>> I think this might be heading in the right direction. Playing about
>>> with
>>> a delay does seem to make the two symptoms less likely. Although I have
>>> to set it quite high (i.e. msleep(100)) to completely avoid any
>>> possibility of seeing either message.
>> The patch might replay the interrupt a little bit faster, but it would
>> be a few microseconds at most I think (just from improved code).
>>
>> Would you be able to ftrace the interrupt handler function and see if
>> you
>> can see a difference in number or timing of interrupts? I'm at a bit of
>> a loss.
> This is getting really weird. I was setting up to run with ftrace and
> found I couldn't reproduce it on the tip of Linus's tree (currently
> pointing at e7a522c83b86). But I swear I could last week. Sure enough
> if I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again.
*Sigh* my master branch still has the interrupt changes reverted.