Re: A weird problem of Realtek r8168 after resume from S3

From: Heiner Kallweit
Date: Mon Dec 17 2018 - 14:08:25 EST


On 17.12.2018 14:25, Chris Chiu wrote:
> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
>>
>> On 14.12.2018 04:33, Chris Chiu wrote:
>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@xxxxxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>> We got an acer laptop which has a problem with ethernet networking after
>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>> follows.
>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>
>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>
> [ 22.362774] r8169 0000:02:00.1 (unnamed net_device)
> (uninitialized): mac_version = 0x2b
> [ 22.365580] libphy: r8169: probed
> [ 22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> XID 5c800800, IRQ 38
> [ 22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> bytes, tx checksumming: ko]
>
Thanks for the info.

>>>> The problem is the ethernet is not accessible after resume. Pinging via
>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>> interface, the networking is back to alive. But it's dead again after
>>>> I stop tcpdump.
>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>
>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>> to find out whether there's a difference.
>>
>
> Actually, I just found I lead the wrong direction. The S3 suspend does
> help to reproduce,
> but it's not necessary. All I need to do is ping around 5 mins and the
> network connection
> fails. And I also find one thing interesting, disabling the MSI-X
> interrupt like commit
> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> Although I don't
> understand the root cause. Anything I can do to help?
>
This is indeed very, very weird. You say switching from MSI-X to MSI fixes
the issue, but also pinging the machine from outside brings back the network.
Both actions affect totally different corners.

The commit and related issue you mention was a workaround in the driver,
the root cause was a MSI-X-related issue with certain Intel chipsets deep
in the PCI core. After this was fixed we removed the workaround again.
This shouldn't be related to your issue.

Hard to say for now is whether the issue is:
- a driver issue
- a hardware issue in the RTL8411
- an issue with the chipset on your mainboard

According to your description it doesn't take a special scenario to trigger
the issue, so most likely also other users of Acer notebooks with RTL8411
should be affected (after briefly checking this should be at least Aspire
F15, V15, V7). Therefore I wonder why there aren't more reports.

This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
So you could test this revision and the one before.

Eventually, if the issue really should be caused by a side effect of using
MSI-X, then the question is whether we need to disable MSI-X for RTL8411
in general or just for RTL8411 and a certain subsystem id.

>>>> I tried the latest 4.20 rc version but the problem still there. I
>>>> also tried some
>>>> hw_reset or init thing in the resume path but no effect. Any
>>>> suggestion for this?
>>>> Thanks
>>>>
>> Did previous kernel versions work? If it's a regression, a bisect would be
>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>
>>>> Chris
>>>
>>> Gentle ping. Any additional information required?
>>>
>>> Chris
>>>
>> Heiner
>