Re: [PATCH v2] net: dsa: mv88e6xxx: propperly shutdown PPU re-enable timer on destroy

From: Andrew Lunn
Date: Tue Dec 03 2024 - 21:31:39 EST


On Tue, Dec 03, 2024 at 03:43:40PM +0100, David Oberhollenzer wrote:
> The mv88e6xxx has an internal PPU that polls PHY state. If we want to
> access the internal PHYs, we need to disable it. Because enable/disable
> of the PPU is a slow operation, a 10ms timer is used to re-enable it,
> canceled with every access, so bulk operations effectively only disable
> it once and re-enable it some 10ms after the last access.
>
> If a PHY is accessed and then the mv88e6xxx module is removed before
> the 10ms are up, the PPU re-enable ends up accessing a dangling pointer.
>
> This especially affects probing during bootup. The MDIO bus and PHY
> registration may succeed, but registration with the DSA framework
> may fail later on (e.g. because the CPU port depends on another,
> very slow device that isn't done probing yet, returning -EPROBE_DEFER).
> In this case, probe() fails, but the MDIO subsystem may already have
> accessed the MIDO bus or PHYs, arming timer.
>
> This is fixed as follows:
> - If probe fails after mv88e6xxx_phy_init(), make sure we also call
> mv88e6xxx_phy_destroy() before returning
> - In mv88e6xxx_phy_destroy(), grab the ppu_mutex to make sure the work
> function either has already exited, or (should it run) cannot do
> anything, fails to grab the mutex and returns.

On first reading this, i did not understand the code is using
mutex_trylock() which made me think it could deadlock. Maybe change
this to "mutex_trylock() fails to get the mutex and returns.

But i'm not actually sure this is needed. There are plenty of other
examples of destroying a work which does not take a mutex.

> - In addition to destroying the timer, also destroy the work item, in
> case the timer has already fired.
> - Do all of this synchronously, to make sure timer & work item are
> destroyed and none of the callbacks are running.

This is the important part, doing it synchronously. cancel_work_sync()
should be enough.

> static void mv88e6xxx_phy_ppu_state_destroy(struct mv88e6xxx_chip *chip)
> {
> + mutex_lock(&chip->ppu_mutex);
> del_timer_sync(&chip->ppu_timer);
> + cancel_work_sync(&chip->ppu_work);
> + mutex_unlock(&chip->ppu_mutex);
> }

/**
* del_timer_sync - Delete a pending timer and wait for a running callback
* @timer: The timer to be deleted
*
* See timer_delete_sync() for detailed explanation.
*
* Do not use in new code. Use timer_delete_sync() instead.


Andrew

---
pw-bot: cr