Re: [Intel-wired-lan] [PATCH] igb: Use a sperate mutex insead of rtnl_lock()

From: Kai-Heng Feng
Date: Thu Mar 26 2020 - 13:16:51 EST


Hi Alexander,

> On Mar 27, 2020, at 00:27, Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote:
>
> On Thu, Mar 26, 2020 at 3:39 AM Kai-Heng Feng
> <kai.heng.feng@xxxxxxxxxxxxx <mailto:kai.heng.feng@xxxxxxxxxxxxx>> wrote:
>>
>> Commit 9474933caf21 ("igb: close/suspend race in netif_device_detach")
>> fixed race condition between close and power management ops by using
>> rtnl_lock().
>>
>> This fix is a preparation for next patch, to prevent a dead lock under
>> rtnl_lock() when calling runtime resume routine.
>>
>> However, we can't use device_lock() in igb_close() because when module
>> is getting removed, the lock is already held for igb_remove(), and
>> igb_close() gets called during unregistering the netdev, hence causing a
>> deadlock. So let's introduce a new mutex so we don't cause a deadlock
>> with driver core or netdev core.
>>
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
>
> So this description doesn't make much sense to me. You describe the
> use of the device_lock() in igb_close() but it isn't used there.

Sorry I forgot to add a revision number.
It was used by previous version and Aaron found a regression when device_lock() is used.

> In addition it seems like you are arbitrarily moving code that was
> wrapped in the rtnl_lock out of it. I'm not entirely sure that is safe
> since there are calls within many of these functions that assume the
> rtnl_lock is held and changing that is likely going to introduce more
> issues.

The reason why rtnl lock needs to be removed is because of the following patch:
https://lore.kernel.org/lkml/20200207101005.4454-2-kai.heng.feng@xxxxxxxxxxxxx/

Ethtools helpers already held rtnl_lock, so to prevent a deadlock, my idea is to use another lock to solve what "igb: close/suspend race in netif_device_detach" originally tried to fix.

>
>
>
>> ---
>> drivers/net/ethernet/intel/igb/igb_main.c | 19 +++++++++++++------
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
>> index b46bff8fe056..dc7ed5dd216b 100644
>> --- a/drivers/net/ethernet/intel/igb/igb_main.c
>> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
>> @@ -288,6 +288,8 @@ static const struct igb_reg_info igb_reg_info_tbl[] = {
>> {}
>> };
>>
>> +static DEFINE_MUTEX(igb_mutex);
>> +
>> /* igb_regdump - register printout routine */
>> static void igb_regdump(struct e1000_hw *hw, struct igb_reg_info *reginfo)
>> {
>> @@ -4026,9 +4028,14 @@ static int __igb_close(struct net_device *netdev, bool suspending)
>>
>> int igb_close(struct net_device *netdev)
>> {
>> + int err = 0;
>> +
>> + mutex_lock(&igb_mutex);
>> if (netif_device_present(netdev) || netdev->dismantle)
>> - return __igb_close(netdev, false);
>> - return 0;
>> + err = __igb_close(netdev, false);
>> + mutex_unlock(&igb_mutex);
>> +
>> + return err;
>> }
>>
>
> Okay, so I am guessing the problem has something to do with the
> addition of the netdev->dismantle test here and the fact that it is
> bypassing the present check for the hotplug remove case?

Please see the rationale above.

>
> So it looks like nobody ever really reviewed commit 888f22931478
> ("igb: Free IRQs when device is hotplugged"). What I would recommend
> is reverting it and instead we fix the remaining pieces that need to
> be addressed in igb so it more closely matches what we have in e1000e
> after commit a7023819404a ("e1000e: Use rtnl_lock to prevent race
> conditions between net and pci/pm"). From what I can tell the only
> pieces that are really missing is to update igb_io_error_detected so
> that in addition to igb_down it will call igb_free_irq, and then in
> addition we should be wrapping most of the code in that function with
> an rtnl_lock since it is detaching a device and making modifications
> to it.

In addition to that, igb_shutdown() indirectly calls igb_close() when netdev unregistering the device.

My "only scratch the surface" approach is because I don't have a reproducer for commit "igb: close/suspend race in netif_device_detach", and I am afraid of breaking it.

Kai-Heng