Re: A weird problem of Realtek r8168 after resume from S3

From: Chris Chiu
Date: Wed Dec 19 2018 - 10:32:48 EST


On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
>
> On 18.12.2018 14:25, Chris Chiu wrote:
> > On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
> >>
> >> On 17.12.2018 14:25, Chris Chiu wrote:
> >>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
> >>>>
> >>>> On 14.12.2018 04:33, Chris Chiu wrote:
> >>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@xxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>> We got an acer laptop which has a problem with ethernet networking after
> >>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>>>> follows.
> >>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>>>
> >>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>>>
> >>> [ 22.362774] r8169 0000:02:00.1 (unnamed net_device)
> >>> (uninitialized): mac_version = 0x2b
> >>> [ 22.365580] libphy: r8169: probed
> >>> [ 22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> >>> XID 5c800800, IRQ 38
> >>> [ 22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> >>> bytes, tx checksumming: ko]
> >>>
> >> Thanks for the info.
> >>
> >>>>>> The problem is the ethernet is not accessible after resume. Pinging via
> >>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>>>> interface, the networking is back to alive. But it's dead again after
> >>>>>> I stop tcpdump.
> >>>>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>>>
> >>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >>>> to find out whether there's a difference.
> >>>>
> >>>
> >>> Actually, I just found I lead the wrong direction. The S3 suspend does
> >>> help to reproduce,
> >>> but it's not necessary. All I need to do is ping around 5 mins and the
> >>> network connection
> >>> fails. And I also find one thing interesting, disabling the MSI-X
> >>> interrupt like commit
> >>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> >>> Although I don't
> >>> understand the root cause. Anything I can do to help?
> >>>
> >> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
> >> the issue, but also pinging the machine from outside brings back the network.
> >> Both actions affect totally different corners.
> >>
> >> The commit and related issue you mention was a workaround in the driver,
> >> the root cause was a MSI-X-related issue with certain Intel chipsets deep
> >> in the PCI core. After this was fixed we removed the workaround again.
> >> This shouldn't be related to your issue.
> >>
> >> Hard to say for now is whether the issue is:
> >> - a driver issue
> >> - a hardware issue in the RTL8411
> >> - an issue with the chipset on your mainboard
> >>
> >> According to your description it doesn't take a special scenario to trigger
> >> the issue, so most likely also other users of Acer notebooks with RTL8411
> >> should be affected (after briefly checking this should be at least Aspire
> >> F15, V15, V7). Therefore I wonder why there aren't more reports.
> >>
> >> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
> >> So you could test this revision and the one before.
> >>
> >> Eventually, if the issue really should be caused by a side effect of using
> >> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
> >> in general or just for RTL8411 and a certain subsystem id.
> >>
> >
> > I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> > interrupt handling"),
> > the problem still there. Then I revert to the previous revision, the
> > problem goes away.
> > So I think it's pretty much the side effect of MSI-X. However, as you
> > mentioned that
> > you didn't hit this problem, I'll ask the vendor to verify if this
> > problem also happens on
> > other machines with the same chip. Then we can determine to disable for specific
> > mac version or just a certain subsystem id.
> >
> >>>>>> I tried the latest 4.20 rc version but the problem still there. I
> >>>>>> also tried some
> >>>>>> hw_reset or init thing in the resume path but no effect. Any
> >>>>>> suggestion for this?
> >>>>>> Thanks
> >>>>>>
> >>>> Did previous kernel versions work? If it's a regression, a bisect would be
> >>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>>>
> >>>>>> Chris
> >>>>>
> >>>>> Gentle ping. Any additional information required?
> >>>>>
> >>>>> Chris
> >>>>>
> >>>> Heiner
> >>>
> >>
> >
>
> As an additional note:
> I found that the rtsx_pci driver doesn't support MSI-X currently.
> The following patch adds MSI-X support (it's compile-tested only
> because I don't have a system with RTL8411).
> Would be interesting to see whether it makes a difference if both
> components on this combo chip use MSI-X.
>
> ---
> drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
> include/linux/rtsx_pci.h | 1 -
> 2 files changed, 16 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
> index da445223f..d1349c248 100644
> --- a/drivers/misc/cardreader/rtsx_pcr.c
> +++ b/drivers/misc/cardreader/rtsx_pcr.c
> @@ -35,10 +35,6 @@
>
> #include "rtsx_pcr.h"
>
> -static bool msi_en = true;
> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
> -MODULE_PARM_DESC(msi_en, "Enable MSI");
> -
> static DEFINE_IDR(rtsx_pci_idr);
> static DEFINE_SPINLOCK(rtsx_pci_lock);
>
> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
>
> static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
> {
> - pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
> - __func__, pcr->msi_en, pcr->pci->irq);
> + int ret;
>
> - if (request_irq(pcr->pci->irq, rtsx_pci_isr,
> - pcr->msi_en ? 0 : IRQF_SHARED,
> - DRV_NAME_RTSX_PCI, pcr)) {
> - dev_err(&(pcr->pci->dev),
> - "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
> - pcr->pci->irq);
> - return -1;
> - }
> + ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
> + if (ret < 0)
> + goto err;
>
> - pcr->irq = pcr->pci->irq;
> - pci_intx(pcr->pci, !pcr->msi_en);
> + ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
> + DRV_NAME_RTSX_PCI);
> + if (ret)
> + goto err;
>
> return 0;
> +err:
> + pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
> + return ret;
> }
>
> static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
> INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
> INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
>
> - pcr->msi_en = msi_en;
> - if (pcr->msi_en) {
> - ret = pci_enable_msi(pcidev);
> - if (ret)
> - pcr->msi_en = false;
> - }
> -
> ret = rtsx_pci_acquire_irq(pcr);
> if (ret < 0)
> - goto disable_msi;
> + goto free_dma;
>
> pci_set_master(pcidev);
> - synchronize_irq(pcr->irq);
>
> ret = rtsx_pci_init_chip(pcr);
> if (ret < 0)
> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
> return 0;
>
> disable_irq:
> - free_irq(pcr->irq, (void *)pcr);
> -disable_msi:
> - if (pcr->msi_en)
> - pci_disable_msi(pcr->pci);
> + pci_free_irq(pcr->pci, 0, pcr);
> +free_dma:
> dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
> pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> unmap:
> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
>
> dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
> pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> - free_irq(pcr->irq, (void *)pcr);
> - if (pcr->msi_en)
> - pci_disable_msi(pcr->pci);
> + pci_free_irq(pcr->pci, 0, pcr);
> iounmap(pcr->remap_addr);
>
> pci_release_regions(pcidev);
> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
> rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>
> pci_disable_device(pcidev);
> - free_irq(pcr->irq, (void *)pcr);
> - if (pcr->msi_en)
> - pci_disable_msi(pcr->pci);
> + pci_free_irq(pcr->pci, 0, pcr);
> }
>
> #else /* CONFIG_PM */
> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
> index e964bbd03..10abfe7f2 100644
> --- a/include/linux/rtsx_pci.h
> +++ b/include/linux/rtsx_pci.h
> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
> /* pci resources */
> unsigned long addr;
> void __iomem *remap_addr;
> - int irq;
>
> /* host reserved buffer */
> void *rtsx_resv_buf;
> --
> 2.20.0
>

As mentioned in the last email, the rtsx_pci seems to make no
difference. I still tried the kernel with this patch applied, the
problem still persists. I also tried the vendor driver and it works
without any problem. I'd rather like to find out the root cause
instead of a workaround. Any better idea?

Chris