Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR
From: Andrew Lunn
Date: Tue May 28 2024 - 14:14:37 EST
On Tue, May 28, 2024 at 01:52:56PM -0400, Sean Anderson wrote:
> (forgot to CC Alex)
>
> On 5/28/24 13:50, Sean Anderson wrote:
> > On 5/28/24 13:28, Russell King (Oracle) wrote:
> >> First, note that phylib's policy is if it loses comms with the PHY,
> >> then the link will be forced down. This is out of control of the SFP
> >> or phylink code.
> >>
> >> I've seen bugs with the I2C emulation on some modules resulting in
> >> problems with various I2C controllers.
> >>
> >> Sometimes the problem is due to a bad I2C level shifter. Some I2C
> >> level shifter manufacturers will swear blind that their shifter
> >> doesn't lock up, but strangely, one can prove with an osciloscope
> >> that it _does_ lock up - and in a way that the only way to recover
> >> was to possibly unplug the module or poewr cycle the platform.
> >
> > Well, I haven't seen any case where the bus locks up. I've been able to
> > recover just by doing
> >
> > ip link set net0 down
> > ip link set net0 up
> >
> > which suggests that this is just a transient problem.
If you look back over the history, i don't think you will find any
reports to transient problems with real MDIO busses. Hence any error
is considered fatal. Also, when you consider the design of MDIO, it is
actually very hard for an error to be detected. It is basically a
shift register, shifting out 64 bits for a write, or 48 bits for a
read, followed by receiving 16 bits for a read. There is no protocol
to indicate any sort of error. If there is no device at the address,
the pullup means you receive 1s. End of story.
With MDIO over I2C, it is I2C which has problems, not MDIO. Do you
expect transient problems with I2C?
I would also point out that MDIO is not idempotent. Reading an
interrupt status register often clears it. Reading the link status
clears the latched link status. If you need to retry the read of the
interrupt status register, you cannot, the interrupt has been cleared,
you have lost it, and probably your hardware no longer works because
you don't know what interrupt to handle.... If you need to re-read the
link status, you have lost the latched version, and you have missed a
up or down event.
> >> My advice would be to investigate the hardware in the first instance.
I agree with Russell. Figure out why I2C is flaky. Since this is an
SFP it maybe something as trivial as the contacts need cleaning. Or
the resistors are wrong, or you have a cheap module which is out of
spec.
Andrew