Re: [PATCH net-next v8 3/7] net: stmmac: refactor FPE verification process
From: Vladimir Oltean
Date: Thu Sep 05 2024 - 09:54:58 EST
On Thu, Sep 05, 2024 at 03:02:24PM +0800, Furong Xu wrote:
> +void stmmac_fpe_apply(struct stmmac_priv *priv)
> +{
> + struct ethtool_mm_state *state = &priv->fpe_cfg.state;
> + struct stmmac_fpe_cfg *fpe_cfg = &priv->fpe_cfg;
> +
> + /* If verification is disabled, configure FPE right away.
> + * Otherwise let the timer code do it.
> + */
> + if (!state->verify_enabled) {
> + stmmac_fpe_configure(priv, priv->ioaddr, fpe_cfg,
> + priv->plat->tx_queues_to_use,
> + priv->plat->rx_queues_to_use,
> + state->tx_enabled,
> + state->pmac_enabled);
> + } else {
> + state->verify_status = ETHTOOL_MM_VERIFY_STATUS_INITIAL;
> + fpe_cfg->verify_retries = STMMAC_FPE_MM_MAX_VERIFY_RETRIES;
> +
> + if (netif_device_present(priv->dev) && netif_running(priv->dev))
> + stmmac_fpe_verify_timer_arm(fpe_cfg);
> + }
> }
In the cover letter, you say:
2. check netif_running() to guarantee synchronization rules between
mod_timer() and timer_delete_sync()
[ by the way, it would be nice if you could list the changes in
individual patches as well ]
but I guess this helps with something other than what you say it helps
with.
netif_running() essentially checks that __dev_open() has been called,
aka "ip link set dev eth0 up". And I don't see the ethtool_ops :: begin()
implemented by the driver any longer, so I think you've done this in
order to accept stmmac_set_mm() calls even before the netdev has been
brought operationally up. Okay.
As for netif_device_present(), I don't know, maybe the intention was to
suppress stmmac_set_mm() calls made after stmmac_suspend(). But
ethnl_ops_begin() has its own netif_device_present() call, so I'm not
sure why it is needed - they should already be suppressed.
But in v7, I was thinking about the concurrency issues here:
static int stmmac_set_mm(struct net_device *ndev, struct ethtool_mm_cfg *cfg,
struct netlink_ext_ack *extack)
{
/* Wait for the verification that's currently in progress to finish */
del_timer_sync(&fpe_cfg->verify_timer);
<- Concurrent code can run here:
stmmac_fpe_link_state_handle(),
called from phylink_resolve()
workqueue context, rtnl_lock()
not held.
spin_lock_irqsave(&fpe_cfg->lock, flags);
stmmac_fpe_apply(priv);
spin_unlock_irqrestore(&fpe_cfg->lock, flags);
}
static void stmmac_fpe_link_state_handle(struct stmmac_priv *priv, bool is_up)
{
struct stmmac_fpe_cfg *fpe_cfg = &priv->fpe_cfg;
unsigned long flags;
timer_delete_sync(&fpe_cfg->verify_timer);
<- Concurrent code can run here:
stmmac_set_mm()
spin_lock_irqsave(&fpe_cfg->lock, flags);
if (is_up && fpe_cfg->state.pmac_enabled) {
/* VERIFY process requires pmac enabled when NIC comes up */
stmmac_fpe_configure(priv, priv->ioaddr, fpe_cfg,
priv->plat->tx_queues_to_use,
priv->plat->rx_queues_to_use,
false, true);
/* New link => maybe new partner => new verification process */
stmmac_fpe_apply(priv);
} else {
/* No link => turn off EFPE */
stmmac_fpe_configure(priv, priv->ioaddr, fpe_cfg,
priv->plat->tx_queues_to_use,
priv->plat->rx_queues_to_use,
false, false);
}
spin_unlock_irqrestore(&fpe_cfg->lock, flags);
}
[ oh btw, you forgot to replace the del_timer_sync() instance from
stmmac_set_mm() to timer_delete_sync() ]
Because the timer can be restarted right after the timer_delete_sync()
call, this is a half-baked implementation.
I think at the end of the day, we need to ask ourselves: what is the
timer_delete_sync() call even supposed to accomplish? What if the verify
timer is allowed to run concurrently with us changing the settings?
Well, for example, if it runs concurrently with
stmmac_fpe_link_state_handle(is_down==false), it will not learn that the
link is down, it will send an MPACKET_VERIFY, get no response, and fail.
So, not very bad.
And the other way around: stmmac_set_mm() stops the verify timer, but
the link comes up, the timer is armed with the old settings, it does
whatever (succeeds, fails), and only afterwards does stmmac_set_mm()
manage to grab &fpe_cfg->lock, change the settings to the new ones, and
re-trigger the verify timer once again, if needed.
So bottom line, I think timer_delete_sync() is to avoid some useless
work, but otherwise, it is not critical to have it. The choice is
between removing the timer_delete_sync() calls from these 2 functions
altogether, or implementing an actually effective mechanism to stop the
timer for a while.
I _think_ that the simplest way to stop it is to hold one more lock for
the verify_timer when we call timer_delete_sync() and stmmac_fpe_verify_timer_arm(),
lock which _is_ IRQ-safe, unlike &fpe_cfg->lock.
static int stmmac_set_mm(struct net_device *ndev, struct ethtool_mm_cfg *cfg,
struct netlink_ext_ack *extack)
{
spin_lock(&fpe_cfg->verify_timer_lock);
timer_delete_sync(&fpe_cfg->verify_timer);
spin_lock_irqsave(&fpe_cfg->lock, flags);
stmmac_fpe_apply(priv);
spin_unlock_irqrestore(&fpe_cfg->lock, flags);
spin_unlock(&fpe_cfg->verify_timer_lock);
}
static void stmmac_fpe_link_state_handle(struct stmmac_priv *priv, bool is_up)
{
spin_lock(&fpe_cfg->verify_timer_lock);
timer_delete_sync(&fpe_cfg->verify_timer);
spin_lock_irqsave(&fpe_cfg->lock, flags);
if (is_up && fpe_cfg->state.pmac_enabled) {
/* VERIFY process requires pmac enabled when NIC comes up */
stmmac_fpe_configure(priv, priv->ioaddr, fpe_cfg,
priv->plat->tx_queues_to_use,
priv->plat->rx_queues_to_use,
false, true);
/* New link => maybe new partner => new verification process */
stmmac_fpe_apply(priv);
} else {
/* No link => turn off EFPE */
stmmac_fpe_configure(priv, priv->ioaddr, fpe_cfg,
priv->plat->tx_queues_to_use,
priv->plat->rx_queues_to_use,
false, false);
}
spin_unlock_irqrestore(&fpe_cfg->lock, flags);
spin_unlock(&fpe_cfg->verify_timer_lock);
}
Looking at the __timer_delete_sync() implementation, I don't think
verify_timer_lock needs to be sleepable and hence a mutex (except on
PREEMPT_RT where spinlocks are sleepable no matter what you do).
But I think the implementation would be simpler without
timer_delete_sync() in these 2 functions, and this overengineered
mechanism.
I would expect a comment in stmmac_release() here:
if (priv->dma_cap.fpesel)
timer_delete_sync(&priv->fpe_cfg.verify_timer);
that timer restarts are not possible, because we have rtnl_lock() held
and a concurrent stmmac_set_mm() cannot run now, and the earlier
phylink_stop() has also ensured stmmac_fpe_link_state_handle() cannot
run any longer.
Similarly, I would like to see an explanation in the form of a comment
for why timer restarts are not possible after the same pattern in
stmmac_suspend(). The explanation is different there, I think.