net: never suspend the ethernet PHY on certain boards?

From: Alexander Dahl
Date: Wed Jun 26 2019 - 07:23:35 EST

Hei hei,

tl;dr: is there a way to prevent an ethernet PHY to ever power down, preferred
with some dt configuration, not with a hack e.g. patching out suspend

With the bugfix 0da70f808029476001109b6cb076737bc04cea2e ("net: macb: do not
disable MDIO bus at open/close time", came with kernel v4.19, was backported
to v4.18.7) a problem arises for us, which was masked before for ages, with a
special combination of SoC, ethernet PHY and other chips on the same board,
and the linux drivers for that.

The boards use either a at91sam9g20 or a sama5d27 SoC, both using cadence/macb
as ethernet driver. Both boards have a smsc LAN8720A ethernet phy attached.
The RMII clock is generated by the PHY, which uses a 25 MHz crystal for that.
This clock line is of course fed into the SoC/MAC, but also used (you might
say hijacked) by other chips on the board which depend on that clock being
_always_ on (at least after initial init on boot). The hardware can not be
changed, we speak of several hundred boards already sold in the last years.

Symptom is: when calling `ip link set down dev eth0` that clock goes off, the
other (not soc nor phy) chips depending on that clock, freeze.

I could bisect this behaviour change on a vanilla kernel to the commit
mentioned above (actually to the backport commit v4.18.7-4-g716fc5ce90cf,
because I bisected from v4.17.19 to v4.18.20).

What I tracked down so far: macb_close() before the bugfix reset the MPE bit
in the MAC Network Control Register, which probably prevents the MAC to send
MDIO telegrams to the PHY? After the bugfix, that bit is not cleared anymore
(to allow still talking to other PHYs on the same MDIO bus, we don't have that
case). I assume communicating with the PHY is still possible then.

macb_close() also calls phy_stop() which sets the state of the phy driver
state machine to PHY_HALTED, with the next run of that state machine
phy_suspend() is called.

The smsc phy driver has no special suspend/resume functions, but uses
genphy_suspend(), that one sets BMCR_PDOWN in MII_BMCR register of that
(standard compliant) PHY. I suspect after that the PHY powers down and the
clock goes off.

I assume before that bugfix, this power down bit could not be set, because the
MDIO interface in the MAC had been disabled, so the PHY stayed on. (However
there's a possible race because in macb_close() the phy_stop() is called
before macb_reset_hw(), right?)

So far, these are mostly assumptions. I did not use gdb on the drivers or a
logic analyzer on the MDIO lines. I could do to prove, however.

What I could do:

1) Revert that change on my tree, which would mean reverting a generic bugfix
2) Patch smsc phy driver to not suspend anymore
3) Invent some new way to prevent suspend on a configuration basis (dt?)
4) Anything I did not think of yet

I know 1) or 2) are hacks without a chance to make it to mainline. What would
be your suggestions for 3) and 4)?