Re: fsl_espi errors on v5.7.15

From: Heiner Kallweit
Date: Tue Aug 25 2020 - 03:22:58 EST


On 25.08.2020 05:54, Chris Packham wrote:
>
> On 25/08/20 10:04 am, Chris Packham wrote:
>>
>> On 20/08/20 9:08 am, Chris Packham wrote:
>>>
>>> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>>>> On 19.08.2020 00:44, Chris Packham wrote:
>>>>> Hi Again,
>>>>>
>>>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>>>
>>>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>>>> based system to linux v5.7.15
>>>>>>>>
>>>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>>>
>>>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>>>
>>>>>>>> # fw_setenv foo=1
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> ...
>>>>>>>>
>>>>>>> This error reporting doesn't exist yet in 4.4. So you may have an
>>>>>>> issue
>>>>>>> under 4.4 too, it's just not reported.
>>>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>>>> If I run fw_printenv (before getting it into a bad state) it is
>>>>>>>> able to
>>>>>>>> display the content of the boards u-boot environment.
>>>>>>>>
>>>>>>> This might indicate an issue with spi being locked. I've seen
>>>>>>> related
>>>>>>> questions, just use the search engine of your choice and check for
>>>>>>> fw_setenv and locked.
>>>>>> I'm running a version of fw_setenv which includes
>>>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it
>>>>>> shouldn't
>>>>>> be locking things unnecessarily.
>>>>>>>> If been unsuccessful in producing a setup for bisecting the
>>>>>>>> issue. I do
>>>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but
>>>>>>>> that's
>>>>>>>> probably not much help.
>>>>>>>>
>>>>>>>> Any pointers on what the issue (and/or solution) might be.
>>>>> I finally managed to get our board running with a vanilla kernel. With
>>>>> corenet64_smp_defconfig I occasionally see
>>>>>
>>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>
>>>>> other than the message things seem to be working.
>>>>>
>>>>> With a custom defconfig I see
>>>>>
>>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>> empty!
>>>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>     ...
>>>>>
>>>>> and access to the spi-nor does not work until the board is reset.
>>>>>
>>>>> I'll try and pick apart the differences between the two defconfigs.
>>>
>>> I now think my earlier testing is invalid. I have seen the problem
>>> with either defconfig if I try hard enough. I had convinced myself
>>> that the problem was CONFIG_PREEMPT but that was before I found
>>> boot-to-boot differences with the same kernel.
>>>
>>> It's possible that I'm chasing multiple issues with the same symptom.
>>>
>>> The error I'm most concerned with is in the sequence
>>> 1. boot with old image
>>> 2. write environment
>>> 3. boot with new image
>>> 4. write environment
>>> 5. write fails and environment is corrupted
>>>
>>> After I recover the system things sometimes seem fine. Until I repeat
>>> the sequence above.
>>>
>>>> Also relevant may be:
>>>> - Which dts are you using?
>>> Custom but based heavily on the t2080rdb.
>>>> - What's the spi-nor type, and at which frequency are you operating it?
>>> The board has several alternate parts for the spi-nor so the dts just
>>> specifies compatible = "jedec,spi-nor" the actual chip detected on
>>> the board I have is "n25q032a (4096 Kbytes)". The dts sets
>>> spi-max-frequency = <10000000> I haven't measured the actual
>>> frequency on the bus.
>>>> - Does the issue still happen if you lower the frequency?
>>> I did play around with the frequency initially but I should probably
>>> give that another go now that I have a better reproduction method.
>>
>> Playing around with the frequency didn't help.
>>
>> One thing that I've found is that the problem appears to be that I end
>> up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO
>> then the system is able to keep accessing the spi-nor (albeit with
>> some noisy logs).
>
> I've been staring at spi-fsl-espi.c for while now and I think I've
> identified a couple of deficiencies that may or may not be related to my
> issue.
>
> First I think the 'Transfer done but SPIE_DON isn't set' message can be
> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
> We also write back to it to clear the current events. We re-read it in
> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
> naturally end up in that situation if we're doing a large read. Consider
> the messages for reading a block of data from a spi-nor chip
>
>  tx = READ_OP + ADDR
>  rx = data
>
> We setup the transfer and pump out the tx_buf. The first interrupt goes
> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
> continues until we've received all the data and we finish with ESPI_SPIE
> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
> isn't set.
>
> The other deficiency is that we only get an interrupt when the amount of
> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
> FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.
>
SPIM_DON will trigger an interrupt once the last characters have been
transferred, and read the remaining characters from the FIFO.

> I think the reason I'm seeing some variability is because of how fast
> (or slow) the interrupts get processed and how fast the spi-nor chip can
> fill the CPUs rx fifo.
>
To rule out timing issues at high bus frequencies I initially asked
for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
or even less, then timing shouldn't be an issue.

Last relevant functional changes have been done almost 4 years ago.
And yours is the first such report I see. So question is what could be so
special with your setup that it seems you're the only one being affected.
The scenarios you describe are standard, therefore much more people
should be affected in case of a driver bug.

You said that kernel config impacts how frequently the issue happens.
Therefore question is what's the diff in kernel config, and how could
the differences be related to SPI.