Re: [PATCH] Revert "net: linkwatch: add check for netdevice being present to linkwatch_do_dev"

From: Heiner Kallweit
Date: Wed Sep 23 2020 - 15:59:12 EST


On 23.09.2020 20:35, Saeed Mahameed wrote:
> On Wed, 2020-09-23 at 13:49 +0200, Heiner Kallweit wrote:
>> On 18.09.2020 19:58, Saeed Mahameed wrote:
>>> On Tue, 2020-09-01 at 17:02 +0200, Geert Uytterhoeven wrote:
>>>> This reverts commit 124eee3f6955f7aa19b9e6ff5c9b6d37cb3d1e2c.
>>>>
>>>> Inami-san reported that this commit breaks bridge support in a
>>>> Xen
>>>> environment, and that reverting it fixes this.
>>>>
>>>> During system resume, bridge ports are no longer enabled, as that
>>>> relies
>>>> on the receipt of the NETDEV_CHANGE notification. This
>>>> notification
>>>> is
>>>> not sent, as netdev_state_change() is no longer called.
>>>>
>>>> Note that the condition this commit intended to fix never existed
>>>> upstream, as the patch triggering it and referenced in the commit
>>>> was
>>>> never applied upstream. Hence I can confirm s2ram on
>>>> r8a73a4/ape6evm
>>>> and sh73a0/kzm9g works fine before/after this revert.
>>>>
>>>> Reported-by Gaku Inami <gaku.inami.xh@xxxxxxxxxxx>
>>>> Signed-off-by: Geert Uytterhoeven <geert+renesas@xxxxxxxxx>
>>>> ---
>>>> net/core/link_watch.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/core/link_watch.c b/net/core/link_watch.c
>>>> index 75431ca9300fb9c4..c24574493ecf95e6 100644
>>>> --- a/net/core/link_watch.c
>>>> +++ b/net/core/link_watch.c
>>>> @@ -158,7 +158,7 @@ static void linkwatch_do_dev(struct
>>>> net_device
>>>> *dev)
>>>> clear_bit(__LINK_STATE_LINKWATCH_PENDING, &dev->state);
>>>>
>>>> rfc2863_policy(dev);
>>>> - if (dev->flags & IFF_UP && netif_device_present(dev)) {
>>>> + if (dev->flags & IFF_UP) {
>>>
>>> So with your issue the devices is both IFF_UP and !present ? how so
>>> ?
>>> I think you should look into that.
>>>
>>> I am ok with removing the "dev present" check from here just
>>> because we
>>> shouldn't be expecting IFF_UP && !present .. such thing must be a
>>> bug
>>> somewhere else.
>>>
>>>> if (netif_carrier_ok(dev))
>>>> dev_activate(dev);
>>>> else
>>
>> In __dev_close_many() we call ndo_stop() whilst IFF_UP is still set.
>> ndo_stop() may detach the device and bring down the PHY, resulting in
>> an
>
> Why would a driver detach the device on ndo_stop() ?
> seems like this is the bug you need to be chasing ..
> which driver is doing this ?
>
Some drivers set the device to PCI D3hot at the end of ndo_stop()
to save power (using e.g. Runtime PM). Marking the device as detached
makes clear to to the net core that the device isn't accessible any
longer.

>> async link change event that calls dev_get_stats(). The latter call
>> may
>> have a problem if the device is detached. In a first place I'd
>> consider
>> such a case a network driver bug (ndo_get_stats/64 should check for
>> device presence if depending on it).
>
> Device drivers should avoid presence check as much as possible
> especially in ndo, this check must be performed by the stack.
>
That's a question I also stumbled across. For the ethtool ops
dev_ethtool() checks whether device is present.
But for ndo that's not always the case, e.g. dev_get_stats()
doesn't check for device presence before calling ndo_get_stats()
or ndo_get_stats64().
To a certain extent I can understand this behavior, because drivers
may just use internal data structures in ndo ops instead of accessing
the device.

>> The additional check in linkwatch_do_dev() was meant to protect from
>> such
>> driver issues.
>