Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
From: Cosmin Ratiu
Date: Mon Feb 02 2026 - 09:53:36 EST
On Thu, 2026-01-29 at 15:40 -0800, Jakub Kicinski wrote:
> On Thu, 29 Jan 2026 10:33:40 +0000 Cosmin Ratiu wrote:
> > > This is quite an ugly hack, is there no way to avoid the flush
> > > and
> > > let
> > > the work discover that what it was supposed to do is no longer
> > > needed?
> >
> > Not possible, unfortunately. I stared at it for quite a while. The
> > wq
> > is flushed because the esw is being unconfigured, which removes
> > data
> > structs the work handler uses. Flushing the work is required,
> > otherwise
> > we'll run into worse issues.
>
> And having a refount on (I presume) struct mlx5_esw_functions
> so that work can hold a ref is not an option?
> Are you planning to revisit this in -next?
Currently, mlx5_eswitch_disable_locked (with the devlink lock held)
waits for esw_vfs_changed_event_handler to finish.
The event handler needs to acquire the same lock and load/unload all
VFs, which touches the entire esw.
I don't currently see how to use reference counting on the esw to avoid
waiting for the handler.
But we can have a deeper look as part of an internal task to improve
this. For now, please accept the V3 fix (about-to-be-sent) with the
current approach because we couldn't find a better way.
Cosmin.