On 2017-10-12 16:42, Guenter Roeck wrote:
On 10/12/2017 07:32 AM, Peter Rosin wrote:
On 2017-10-12 13:35, Peter Rosin wrote:
Hi!
I have encountered an "interesting" bug. It silently corrupts data
and is generally nasty...
On an I2C bus, driven by the at91 driver and DMA (an Atmel
sama5d31 chip), I have an 256 byte eeprom (NXP SE97BTP). I'm using
Linux v4.13.
The at24 driver for the eeprom detects that the I2C adapter is
capable of I2C transactions and selects that over SMBus. Reads
are done in 128 byte chunks. However, sometimes there is some
kind of event that disturbs the transactions such that the very
last bit och the very last byte (and the following NACK and STOP)
of such chunks are delayed for a long time (the latest incident
shows 85ms on the scope). That is too long for the eeprom which
is expecting SMBus and times out after about 30 ms. When the
eeprom times out, it just releases the data line so that it is
pulled up high. The I2C driver does not notice this, and when it
finally gets going, it reads a one for the last bit instead of
the expected zero. Since it is the last byte of the read, a NACK
is expected and since the eeprom has timed out the NACK is there.
And the STOP condition also looks normal (expected since it is
generated by the driver itself). So, the driver has not noticed
anything funny. But the data is corrupted.
I can work around this by disabling the SMBus timeout in the eeprom
with:
i2cset -f 0 0x18 0x22 0x8100
But that is done on a different I2C address (the eeprom is on
address 0x50), since the chip is a combined temperature sensor and
eeprom, and the SMBus timeout bit is of course in a temperature
sensor register.
HOWEVER, I fail to see how this is limited to my case with this
eeprom. Any SMBus chip with a timeout will suffer the same fate.
The real bug is that this happens without the driver noticing it.
And why is there a 85ms delay in the middle of the last byte?
Sure, I can see why there might be a delay before finishing up
with a STOP condition or between bytes if there needs to be some
DMA setup at some interval, but after the 7th bit of a byte?
For a lot of transactions on the I2C bus there is no delay before
the last bit. And most of the time there is no delay for the
eeprom reads either; the delay only occurs when it feels like it.
This does not feel good at all.
I added some traces to i2c-at91.c and, AFAIU, it's the call to
at91_twi_read_data_dma_callback that sometimes arrives later than
desired. Once the callback runs, the transfer completes swiftly.
After reading the comments in that driver I suppose the HW holds
on to the last data-bit until it knows whether to ACK or NACK in
the following bit.
But given this, it is questionable if this driver/HW combo can
claim support for SMBus. But then again, I expect many things
suffer from similar scheduling delays (presumably that's what's
going on) so this is probably not special to i2c-at91.c...
Since this is probably a very generic problem and I just happened
to hit it for the eeprom, I wonder if it would be ok to add a
workaround, as below, to the temperature sensor driver part of this
chip? (with suitable comments, defines for the constants etc -
setting the 0x0080 bit in reg 0x22 disables the SMBus timeout)
Cheers,
Peter
diff --git a/drivers/hwmon/jc42.c b/drivers/hwmon/jc42.c
index 1bf22eff0b08..3e72bd8e06d1 100644
--- a/drivers/hwmon/jc42.c
+++ b/drivers/hwmon/jc42.c
@@ -416,6 +416,13 @@ static int jc42_detect(struct i2c_client *client, struct i2c_board_info *info)
if ((cap & 0xff00) || (config & 0xf800))
return -ENODEV;
+ if (manid == NXP_MANID && (devid & SE97_DEVID_MASK) == SE97_DEVID) {
+ int smbus = i2c_smbus_read_word_swapped(client, 0x22);
+ if (smbus < 0)
+ return -ENODEV;
+ i2c_smbus_write_word_swapped(client, 0x22, smbus | 0x0080);
+ }
+
Outch. Not like that; this would affect every board with this chip, not just this one.
That was kind of the intention. Silent corruption is nasty. But ok, let's
do the opt-in thing.
We would need something like a DT property to do that (smbus-timeout-disable is used
in other drivers).
.. and definitely not in the detect function. This would have to be done in probe.
Arrgh, I did test it, but not properly. The patch passed my tests
because I didn't power-cycle the board and the already disabled timeout
stayed in place for the next boot. I just assumed jc42_detect was
called from probe and it was convenient to have the manufacturer etc
already at hand. Silly silly.
I'll send a proper patch tomorrow with these things taken care of...
Cheers,
peda