Re: [PATCH net] r8169: correct the reset timing of RTL8125 for link-change event

From: Heiner Kallweit
Date: Tue Sep 10 2024 - 13:07:37 EST


On 09.09.2024 07:25, En-Wei WU wrote:
> Hi Heiner,
>
> Thank you for the quick response.
>
> On Sat, 7 Sept 2024 at 05:17, Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
>>
>> On 06.09.2024 10:35, En-Wei Wu wrote:
>>> The commit 621735f59064 ("r8169: fix rare issue with broken rx after
>>> link-down on RTL8125") set a reset work for RTL8125 in
>>> r8169_phylink_handler() to avoid the MAC from locking up, this
>>> makes the connection broken after unplugging then re-plugging the
>>> Ethernet cable.
>>>
>>> This is because the commit mistakenly put the reset work in the
>>> link-down path rather than the link-up path (The commit message says
>>> it should be put in the link-up path).
>>>
>> That's not what the commit message is saying. It says vendor driver
>> r8125 does it in the link-up path.
>> I moved it intentionally to the link-down path, because traffic may
>> be flowing already after link-up.
>>
>>> Moving the reset work from the link-down path to the link-up path fixes
>>> the issue. Also, remove the unnecessary enum member.
>>>
>> The user who reported the issue at that time confirmed that the original
>> change fixed the issue for him.
>> Can you explain, from the NICs perspective, what exactly the difference
>> is when doing the reset after link-up?
>> Including an explanation how the original change suppresses the link-up
>> interrupt. And why that's not the case when doing the reset after link-up.
>
> The host-plug test under original change does have the link-up
> interrupt and r8169_phylink_handler() called. There is not much clue
> why calling reset in link-down path doesn't work but in link-up does.
>
> After several new tests, I found that with the original change, the
> link won't break if I unplug and then plug the cable within about 3
> seconds. On the other hand, the connections always break if I re-plug
> the cable after a few seconds.
>
Interesting finding. 3 seconds sounds like it's unrelated to runtime pm,
because this has a 10s delay before the chip is transitioned to D3hot.
It makes more the impression that after 3s of link-down the chip (PHY?)
transitions to a mode where it doesn't wake up after re-plugging the cable.

Just a wild guess: It may be some feature like ALDPS (advanced link-down
power saving). Depending on the link partner this may result in not waking
up again, namely if the link partner uses ALDPS too.
What is the link partner in your case? If you put a simple switch in between,
does this help?

In the RTL8211F datasheet I found the following:

Link Down Power Saving Mode.
1: Reflects local device entered Link Down Power Saving Mode,
i.e., cable not plugged in (reflected after 3 sec)
0: With cable plugged in

This is a 1Gbps PHY, but Realtek may use the same ALDPS mechanism with the
integrated PHY of RTL8125. The 3s delay described there perfectly matches
your finding.

> With this new patch (reset in link-up path), both of the tests work
> without any error.
>
>>
>> I simply want to be convinced enough that your change doesn't break
>> behavior for other users.
>>
>>> Fixes: 621735f59064 ("r8169: fix rare issue with broken rx after link-down on RTL8125")
>>> Signed-off-by: En-Wei Wu <en-wei.wu@xxxxxxxxxxxxx>
>>> ---
>>> drivers/net/ethernet/realtek/r8169_main.c | 11 +++++------
>>> 1 file changed, 5 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>> index 3507c2e28110..632e661fc74b 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -590,7 +590,6 @@ struct rtl8169_tc_offsets {
>>> enum rtl_flag {
>>> RTL_FLAG_TASK_ENABLED = 0,
>>> RTL_FLAG_TASK_RESET_PENDING,
>>> - RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE,
>>> RTL_FLAG_TASK_TX_TIMEOUT,
>>> RTL_FLAG_MAX
>>> };
>>> @@ -4698,8 +4697,6 @@ static void rtl_task(struct work_struct *work)
>>> reset:
>>> rtl_reset_work(tp);
>>> netif_wake_queue(tp->dev);
>>> - } else if (test_and_clear_bit(RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE, tp->wk.flags)) {
>>> - rtl_reset_work(tp);
>>> }
>>> out_unlock:
>>> rtnl_unlock();
>>> @@ -4729,11 +4726,13 @@ static void r8169_phylink_handler(struct net_device *ndev)
>>> if (netif_carrier_ok(ndev)) {
>>> rtl_link_chg_patch(tp);
>>> pm_request_resume(d);
>>> - netif_wake_queue(tp->dev);
>>> - } else {
>>> +
>>> /* In few cases rx is broken after link-down otherwise */
>>> if (rtl_is_8125(tp))
>>> - rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE);
>>> + rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>> + else
>>> + netif_wake_queue(tp->dev);
>>
>> This call to netif_wake_queue() isn't needed any longer, it was introduced with
>> the original change only.
>>
>>> + } else {
>>> pm_runtime_idle(d);
>>> }
>>>
>>
>
> CC. Martin Kjær Jørgensen <me@xxxxxxxx>, could you kindly test if
> this new patch works on your environment? Thanks!
>
> En-Wei,
> Best regards.