Re: [RFC] [PATCH net-next v4] [PATCH 2/2] r8169: Implement dynamic ASPM mechanism

From: Kai-Heng Feng
Date: Mon Sep 06 2021 - 11:11:13 EST


On Sat, Sep 4, 2021 at 4:00 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
>
> On 03.09.2021 17:56, Kai-Heng Feng wrote:
> > On Tue, Aug 31, 2021 at 2:09 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >>
> >> On Sat, Aug 28, 2021 at 01:14:52AM +0800, Kai-Heng Feng wrote:
> >>> r8169 NICs on some platforms have abysmal speed when ASPM is enabled.
> >>> Same issue can be observed with older vendor drivers.
> >>>
> >>> The issue is however solved by the latest vendor driver. There's a new
> >>> mechanism, which disables r8169's internal ASPM when the NIC traffic has
> >>> more than 10 packets, and vice versa. The possible reason for this is
> >>> likely because the buffer on the chip is too small for its ASPM exit
> >>> latency.
> >>
> >> This sounds like good speculation, but of course, it would be better
> >> to have the supporting data.
> >>
> >> You say above that this problem affects r8169 on "some platforms." I
> >> infer that ASPM works fine on other platforms. It would be extremely
> >> interesting to have some data on both classes, e.g., "lspci -vv"
> >> output for the entire system.
> >
> > lspci data collected from working and non-working system can be found here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=214307
> >
> >>
> >> If r8169 ASPM works well on some systems, we *should* be able to make
> >> it work well on *all* systems, because the device can't tell what
> >> system it's in. All the device can see are the latencies for entry
> >> and exit for link states.
> >
> > That's definitely better if we can make r8169 ASPM work for all platforms.
> >
> >>
> >> IIUC this patch makes the driver wake up every 1000ms. If the NIC has
> >> sent or received more than 10 packets in the last 1000ms, it disables
> >> ASPM; otherwise it enables ASPM.
> >
> > Yes, that's correct.
> >
> >>
> >> I asked these same questions earlier, but nothing changed, so I won't
> >> raise them again if you don't think they're pertinent. Some patch
> >> splitting comments below.
> >
> > Sorry about that. The lspci data is attached.
> >
>
> Thanks for the additional details. I see that both systems have the L1
> sub-states active. Do you also face the issue if L1 is enabled but
> L1.2 and L1.2 are not? Setting the ASPM policy from powersupersave
> to powersave should be sufficient to disable them.
> I have a test system Asus PRIME H310I-PLUS, BIOS 2603 10/21/2019 with
> the same RTL8168h chip version. With L1 active and sub-states inactive
> everything is fine. With the sub-states activated I get few missed RX
> errors when running iperf3.

Once L1.1 and L1.2 are disabled the TX speed can reach 710Mbps and RX
can reach 941 Mbps. So yes it seems to be the same issue.
With dynamic ASPM, TX can reach 750 Mbps while ASPM L1.1 and L1.2 are enabled.

> One difference between your good and bad logs is the following.
> (My test system shows the same LTR value like your bad system.)
>
> Bad:
> Capabilities: [170 v1] Latency Tolerance Reporting
> Max snoop latency: 3145728ns
> Max no snoop latency: 3145728ns
>
> Good:
> Capabilities: [170 v1] Latency Tolerance Reporting
> Max snoop latency: 1048576ns
> Max no snoop latency: 1048576ns
>
> I have to admit that I'm not familiar with LTR and don't know whether
> this difference could contribute to the differing behavior.

I am also unsure what role LTR plays here, so I tried to change the
LTR value to 1048576ns and yield the same result, the TX and RX remain
very slow.

Kai-Heng