Re: [PATCH net] team: Fix ABBA deadlock caused by race in team_del_slave

From: Eric Dumazet
Date: Wed Jul 03 2024 - 12:30:30 EST


On Wed, Jul 3, 2024 at 6:02 PM Jeongjun Park <aha310510@xxxxxxxxx> wrote:
>
> >
> > On Wed, Jul 03, 2024 at 11:51:59PM +0900, Jeongjun Park wrote:
> > > CPU0 CPU1
> > > ---- ----
> > > lock(&rdev->wiphy.mtx);
> > > lock(team->team_lock_key#4);
> > > lock(&rdev->wiphy.mtx);
> > > lock(team->team_lock_key#4);
> > >
> > > Deadlock occurs due to the above scenario. Therefore,
> > > modify the code as shown in the patch below to prevent deadlock.
> > >
> > > Regards,
> > > Jeongjun Park.
> >
> > The commit message should contain the patch description only (without
> > salutations, etc.).
> >
> > >
> > > Reported-and-tested-by: syzbot+705c61d60b091ef42c04@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Fixes: 61dc3461b954 ("team: convert overall spinlock to mutex")
> > > Signed-off-by: Jeongjun Park <aha310510@xxxxxxxxx>
> > > ---
> > > drivers/net/team/team_core.c | 14 ++++++++------
> > > 1 file changed, 8 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> > > index ab1935a4aa2c..3ac82df876b0 100644
> > > --- a/drivers/net/team/team_core.c
> > > +++ b/drivers/net/team/team_core.c
> > > @@ -1970,11 +1970,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> > > struct netlink_ext_ack *extack)
> > > {
> > > struct team *team = netdev_priv(dev);
> > > - int err;
> > > + int err, locked;
> > >
> > > - mutex_lock(&team->lock);
> > > + locked = mutex_trylock(&team->lock);
> > > err = team_port_add(team, port_dev, extack);
> > > - mutex_unlock(&team->lock);
> > > + if (locked)
> > > + mutex_unlock(&team->lock);
> >
> > This is not correct usage of 'mutex_trylock()' API. In such a case you
> > could as well remove the lock completely from that part of code.
> > If "mutex_trylock()" returns false it means the mutex cannot be taken
> > (because it was already taken by other thread), so you should not modify
> > the resources that were expected to be protected by the mutex.
> > In other words, there is a risk of modifying resources using
> > "team_port_add()" by several threads at a time.
> >
> > >
> > > if (!err)
> > > netdev_change_features(dev);
> > > @@ -1985,11 +1986,12 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> > > static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
> > > {
> > > struct team *team = netdev_priv(dev);
> > > - int err;
> > > + int err, locked;
> > >
> > > - mutex_lock(&team->lock);
> > > + locked = mutex_trylock(&team->lock);
> > > err = team_port_del(team, port_dev);
> > > - mutex_unlock(&team->lock);
> > > + if (locked)
> > > + mutex_unlock(&team->lock);
> >
> > The same story as in case of "team_add_slave()".
> >
> > >
> > > if (err)
> > > return err;
> > > --
> > >
> >
> > The patch does not seem to be a correct solution to remove a deadlock.
> > Most probably a synchronization design needs an inspection.
> > If you really want to use "mutex_trylock()" API, please consider several
> > attempts of taking the mutex, but never modify the protected resources when
> > the mutex is not taken successfully.
> >
>
> Thanks for your comment. I rewrote the patch based on those comments.
> This time, we modified it to return an error so that resources are not
> modified when a race situation occurs. We would appreciate your
> feedback on what this patch would be like.
>
> > Thanks,
> > Michal
> >
> >
>
> Regards,
> Jeongjun Park
>
> ---
> drivers/net/team/team_core.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/team/team_core.c b/drivers/net/team/team_core.c
> index ab1935a4aa2c..43d7c73b25aa 100644
> --- a/drivers/net/team/team_core.c
> +++ b/drivers/net/team/team_core.c
> @@ -1972,7 +1972,8 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev,
> struct team *team = netdev_priv(dev);
> int err;
>
> - mutex_lock(&team->lock);
> + if (!mutex_trylock(&team->lock))
> + return -EBUSY;
> err = team_port_add(team, port_dev, extack);
> mutex_unlock(&team->lock);
>
> @@ -1987,7 +1988,8 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
> struct team *team = netdev_priv(dev);
> int err;
>
> - mutex_lock(&team->lock);
> + if (!mutex_trylock(&team->lock))
> + return -EBUSY;
> err = team_port_del(team, port_dev);
> mutex_unlock(&team->lock);
>
> --

Failing team_del_slave() is not an option. It will add various issues.