Re: [PATCH] PCI: r8169: add suspend/resume aspm quirk
From: Bjorn Helgaas
Date: Wed Jul 10 2024 - 17:38:50 EST
On Wed, Jul 10, 2024 at 05:09:08PM +0200, George-Daniel Matei wrote:
> >> Added aspm suspend/resume hooks that run
> >> before and after suspend and resume to change
> >> the ASPM states of the PCI bus in order to allow
> >> the system suspend while trying to prevent card hangs
> >
> > Why is this needed? Is there a r8169 defect we're working around?
> > A BIOS defect? Is there a problem report you can reference here?
>
> We encountered this issue while upgrading from kernel v6.1 to v6.6.
> The system would not suspend with 6.6. We tracked down the problem to
> the NIC of the device, mainly that the following code was removed in
> 6.6:
>
> > else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
> > rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
>
> For the listed devices, ASPM L1 is disabled entirely in 6.6. As for
> the reason, L1 was observed to cause some problems
> (https://bugzilla.kernel.org/show_bug.cgi?id=217814). We use a Raptor
> Lake soc and it won't change residency if the NIC doesn't have L1
> enabled. I saw in 6.1 the following comment:
Can you verify that the problem still exists in a current kernel,
e.g., v6.9?
If this is a regression that's still present in v6.9, we need to
identify the commit that broke it. Maybe it's 90ca51e8c654 ("r8169:
fix ASPM-related issues on a number of systems with NIC version from
RTL8168h")?
> > Chips from RTL8168h partially have issues with L1.2, but seem
> > to work fine with L1 and L1.1.
>
> I was thinking that disabling/enabling L1.1 on the fly before/after
> suspend could help mitigate the risk associated with L1/L1.1 . I know
> that ASPM settings are exposed in sysfs and that this could be done
> from outside the kernel, that was my first approach, but it was
> suggested to me that this kind of workaround would be better suited
> for quirks. I did around 1000 suspend/resume cycles of 16-30 seconds
> each (correcting the resume dev->bus->self being configured twice
> mistake) and did not notice any problems. What do you think, is this a
> good approach ... ?
Whatever the problem is, it definitely should be fixed in the kernel,
and Ilpo is right that it *should* be done in the PCI core ASPM
support (aspm.c) or at least with interfaces it supplies.
Generally speaking, drivers should not need to touch ASPM at all
except to work around hardware defects in their device, but r8169 has
a long history of weird ASPM stuff. I dunno if that stuff is related
to hardware defects in the r8169 devices or if it is workarounds for
past or current defects in aspm.c.
> > This doesn't restore the state as it existed before suspend. Does
> > this rely on other parts of restore to do that?
>
> It operates on the assumption that after driver initialization
> PCI_EXP_LNKCTL_ASPMC is 0 and that there are no states enabled in
> CTL1. I did a lspci -vvv dump on the affected devices before and after
> the quirks ran and saw no difference. This could be improved.
Yep, we can't assume any of that because the PCI core owns ASPM
config, not the driver itself.
> > What's the root cause of the issue?
> > A silicon bug on the host side?
>
> I think it's the ASPM implementation of the soc.
As Heiner pointed out, if it's a SoC defect, it would potentially
affect all devices and a workaround would have to cover them all.
Side note: oops, quoting error below, see note about top-posting here:
https://people.kernel.org/tglx/notes-about-netiquette
> On Tue, Jul 9, 2024 at 12:15 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote:
> >
> > On 08.07.2024 19:23, Bjorn Helgaas wrote:
> > > [+cc r8169 folks]
> > >
> > > On Mon, Jul 08, 2024 at 03:38:15PM +0000, George-Daniel Matei wrote:
> > >> Added aspm suspend/resume hooks that run
> > >> before and after suspend and resume to change
> > >> the ASPM states of the PCI bus in order to allow
> > >> the system suspend while trying to prevent card hangs
> > >
> > > Why is this needed? Is there a r8169 defect we're working around?
> > > A BIOS defect? Is there a problem report you can reference here?
> ...