Re: fsl_espi errors on v5.7.15
From: Chris Packham
Date: Tue Aug 25 2020 - 18:22:30 EST
On 25/08/20 7:22 pm, Heiner Kallweit wrote:
<snip>
> I've been staring at spi-fsl-espi.c for while now and I think I've
>> identified a couple of deficiencies that may or may not be related to my
>> issue.
>>
>> First I think the 'Transfer done but SPIE_DON isn't set' message can be
>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
>> We also write back to it to clear the current events. We re-read it in
>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>> naturally end up in that situation if we're doing a large read. Consider
>> the messages for reading a block of data from a spi-nor chip
>>
>> tx = READ_OP + ADDR
>> rx = data
>>
>> We setup the transfer and pump out the tx_buf. The first interrupt goes
>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>> continues until we've received all the data and we finish with ESPI_SPIE
>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>> isn't set.
>>
>> The other deficiency is that we only get an interrupt when the amount of
>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>> FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.
>>
> SPIM_DON will trigger an interrupt once the last characters have been
> transferred, and read the remaining characters from the FIFO.
The T2080RM that I have says the following about the DON bit
"Last character was transmitted. The last character was transmitted and
a new command can be written for the next frame."
That does at least seem to fit with my assertion that it's all about the
TX direction. But the fact that it doesn't happen all the time throws
some doubt on it.
> I think the reason I'm seeing some variability is because of how fast
>> (or slow) the interrupts get processed and how fast the spi-nor chip can
>> fill the CPUs rx fifo.
>>
> To rule out timing issues at high bus frequencies I initially asked
> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
> or even less, then timing shouldn't be an issue.
Yes I've currently got spi-max-frequency = <1000000>; in my dts. I would
also expect a slower frequency would fit my "DON is for TX" narrative.
> Last relevant functional changes have been done almost 4 years ago.
> And yours is the first such report I see. So question is what could be so
> special with your setup that it seems you're the only one being affected.
> The scenarios you describe are standard, therefore much more people
> should be affected in case of a driver bug.
Agreed. But even on my hardware (which may have a latent issue despite
being in the field for going on 5 years) the issue only triggers under
some fairly specific circumstances.
> You said that kernel config impacts how frequently the issue happens.
> Therefore question is what's the diff in kernel config, and how could
> the differences be related to SPI.
It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
impact but every time I found something that seemed to be having an
impact I've been able to disprove it. I actually think its about how
busy the system is which may or may not affect when we get round to
processing the interrupts.
I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
occur on the T2080RDB.
I've had to add the following to expose the environment as a mtd partition
diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
index ff87e67c70da..fbf95fc1fd68 100644
--- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
@@ -116,6 +116,15 @@ flash@0 {
compatible = "micron,n25q512ax3",
"jedec,spi-nor";
reg = <0>;
spi-max-frequency = <10000000>; /*
input clock */
+
+ partition@u-boot {
+ reg = <0x00000000 0x00100000>;
+ label = "u-boot";
+ };
+ partition@u-boot-env {
+ reg = <0x00100000 0x00010000>;
+ label = "u-boot-env";
+ };
};
};
And I'm using the following script to poke at the environment (warning
if anyone does try this and the bug hits it can render your u-boot
environment invalid).
cat flash/fw_env_test.sh
#!/bin/sh
generate_fw_env_config()
{
cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize name ; do
echo "$dev $size $erasesize $name"
[ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
$erasesize" >/flash/fw_env.config
done
}
cycles=10
[ $# -ge 1 ] && cycles=$1
generate_fw_env_config
fw_printenv -c /flash/fw_env.config
dmesg -c >/dev/null
x=0
while [ $x -lt $cycles ]; do
fw_printenv -c /flash/fw_env.config >/dev/null || break
fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
dmesg -c | grep -q fsl_espi && break;
let x=x+1
done
echo "Ran $x cycles"