Re: [PATCH 0/3] Fix for KSZ DSA switch shutdown

From: Vladimir Oltean
Date: Thu Sep 09 2021 - 18:55:07 EST


On Thu, Sep 09, 2021 at 07:07:33PM +0200, Lino Sanfilippo wrote:
> > It does not scale really well to have individual drivers call
> > dsa_tree_shutdown() in their respective .shutdown callback, and in a
> > multi-switch configuration, I am not sure what the results would
> > look like.
> >
> > In premise, each driver ought to be able to call
> > dsa_unregister_switch(), along with all of the driver specific
> > shutdown and eventually, given proper device ordering the DSA tree
> > would get automatically torn down, and then the DSA master's
> > .shutdown() callback would be called.
> >
> > FWIW, the reason why we call .shutdown() in bcmgenet is to turn off
> > DMA and clocks, which matters for kexec (DMA) as well as power
> > savings (S5 mode).
>
> I agree with the scalability. Concerning the multi-switch case I dont
> know about the possible issues (I am quite new to working with DSA).
> So lets wait for Vladimirs solution.

I'm back for now and was able to spend a bit more time and understand
what is happening.

So first things first: why does DSA call dev_hold long-term on the
master, and where from?

Answer: it does so since commit 2f1e8ea726e9 ("net: dsa: link interfaces
with the DSA master to get rid of lockdep warnings"), see this call path:

dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold

Ok, so since DSA holds a reference to the master interface, it is
natural that unregister_netdevice() will not finish, and it will hang
the system.

Question 2: why does bcmgenet need to unregister the net device on
shutdown?

See Florian's answer, it doesn't, strictly speaking, it just needs to
turn off the DMA and some clocks.

Question 3: can we revert commit 2f1e8ea726e9?

Answer: not so easily, we are looking at >10 commits to revert, and find
other solutions to some problems. We have built in the meantime on top
of the fact that there is an upper/lower relationship between DSA user
ports and the DSA master.

Question 4: how do other stacked interfaces deal with this?

Answer: as I said in the commit message of 2f1e8ea726e9, DSA is not
VLAN, DSA has unique challenges of its own, like a tree of struct
devices to manage, with their own lifetime. So what other drivers do is
not really relevant. Anyway, to entertain the question: VLAN watches the
NETDEV_UNREGISTER event emitted on the netdev notifier chain for its
real_dev, and effectively unregisters itself. Now this is exactly why it
is irrelevant, we can watch for NETDEV_UNREGISTER on the DSA master, but
then what? There is nothing sensible to do. Consider that in the master
unbind case (not shutdown), both the NETDEV_UNREGISTER code path will
execute, and the unbind of the DSA switch itself, due to that device
link. But let's say we delete the device link and leave only the
NETDEV_UNREGISTER code path to do something. What?
device_release_driver(ds->dev), most probably. That would effectively
force the DSA unbind path. But surprise: the DSA unbind path takes the
rtnl_mutex from quite a couple of places, and we are already under the
rtnl_lock (held by the netdev notifier chain). So, unless we schedule
the DSA device driver detach, there is an impending deadlock.
Ok, let's entertain even that: detach the DSA driver in a scheduled work
item, with the rtnl_lock not held. First off, we will trigger again the
WARN_ON solved by commit 2f1e8ea726e9 (because the unregistering of the
DSA master has "completed", but it still has an upper interface - us),
and secondly, the unregister_netdev function will have already deleted
stuff belonging to the DSA master, namely its sysfs entries. But DSA
also touches the master's sysfs, namely the "tagging" file. So NULL
pointer dereference on the master's sysfs.
So very simply put, DSA cannot unbind itself from the switch device when
the master net device unregisters. The best case scenario would be for
DSA to unbind _before_ the net device even unregisters. That was the
whole point of my attempt with the device links, to ensure shutdown
_ordering_.

Question 5: can the device core actually be patched to call
device_links_unbind_consumers() from device_shutdown()? This would
actually simplify DSA's options, and make the device links live up to
their documented expectations.

Answer: yes and no, technically it can, but it is an invasive change
which will certainly introduce regressions. See the answer to question 2
for an example. Technically .shutdown exists so that drivers can do
something lightweight to quiesce the hardware, without really caring too
much about data structure integrity (hey, the kernel is going to die
soon anyway). But some drivers, like bcmgenet, do the same thing in
.resume and .shutdown, which blurs the lines quite a lot. If the device
links were to start calling .remove at shutdown time, potentially after
.shutdown was already called, bcmgenet would effectively unregister its
net device twice. Yikes.

Question 6: How about a patch on the device core that is more lightweight?
Wouldn't it be sensible for device_shutdown() to just call ->remove if
the device's bus has no ->shutdown, and the device's driver doesn't have
a ->shutdown either?

Answer: This would sometimes work, the vast majority of DSA switch
drivers, and Ethernet controllers (in this case used as DSA masters) do
not have a .shutdown method implemented. But their bus does: PCI does,
SPI controllers do, most of the time. So it would work for limited
scenarios, but would be ineffective in the general sense.

Question 7: I said that .shutdown, as opposed to .remove, doesn't really
care so much about the integrity of data structures. So how far should
we really go to fix this issue? Should we even bother to unbind the
whole DSA tree, when the sole problem is that we are the DSA master's
upper, and that is keeping a reference on it?

Answer: Well, any solution that does unnecessary data structure teardown
only delays the reboot for nothing. Lino's patch just bluntly calls
dsa_tree_teardown() from the switch .shutdown method, and this leaks
memory, namely dst->ports. But does this really matter? Nope, so let's
extrapolate. In this case, IMO, the simplest possible solution would be
to patch bcmgenet to not unregister the net device. Then treat every
other DSA master driver in the same way as they come, one by one.
Do you need to unregister_netdevice() at shutdown? No. Then don't.
Is it nice? Probably not, but I'm not seeing alternatives.

Also, unless I'm missing something, Lino probably still sees the WARN_ON
in bcmgenet's unregister_netdevice() about eth0 getting unregistered
while having an upper interface. If not, it's by sheer luck that the DSA
switch's ->shutdown gets called before bcmgenet's ->shutdown. But for
this reason, it isn't a great solution either. If the device links can't
guarantee us some sort of shutdown ordering (what we ideally want, as
mentioned, is for the DSA switch driver to get _unbound_ (->remove)
before the DSA master gets unbound or shut down).