Re: [PATCH v2] PCI: vmd: Enable PCI PM's L1 substates of remapped PCIe Root Port and NVMe

From: Bjorn Helgaas
Date: Tue Feb 06 2024 - 18:30:22 EST


On Tue, Feb 06, 2024 at 01:25:29PM -0800, David E. Box wrote:
> On Mon, 2024-02-05 at 15:05 -0800, David E. Box wrote:
> > On Mon, 2024-02-05 at 16:42 -0600, Bjorn Helgaas wrote:
> > > On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> > > > On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > > > > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > > > ...
> > >
> > > > > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev
> > > > > > *pdev,
> > > > > > void *userdata)
> > > > > >         pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT,
> > > > > > ltr_reg);
> > > > > >         pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > > > >
> > > > > You're not changing this part, and I don't understand exactly how LTR
> > > > > works, but it makes me a little bit queasy to read "set the LTR value
> > > > > to the maximum required to allow the deepest power management
> > > > > savings" and then we set the max snoop values to a fixed constant.
> > > > >
> > > > > I don't think the goal is to "allow the deepest power savings"; I
> > > > > think it's to enable L1.2 *when the device has enough buffering to
> > > > > absorb L1.2 entry/exit latencies*.
> > > > >
> > > > > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > > > > the platform's maximum supported latency or less," so it seems like
> > > > > that value must be platform-dependent, not fixed.
> > > > >
> > > > > And I assume the "_DSM for Latency Tolerance Reporting" is part of the
> > > > > way to get those platform-dependent values, but Linux doesn't actually
> > > > > use that yet.
> > > >
> > > > This may indeed be the best way but we need to double check with our
> > > > BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> > > > hasn't been a need to use this _DSM. But under VMD the ports are
> > > > hidden from BIOS which is why we added it here. I've brought up the
> > > > question internally to find out how Windows handles the DSM and to
> > > > get a recommendation from our firmware leads.
> > >
> > > We want Linux to be able to program LTR itself, don't we?  We
> > > shouldn't have to rely on firmware to do it.  If Linux can't do
> > > it, hot-added devices aren't going to be able to use L1.2,
> > > right?
> >
> > Agreed. We just want to make sure we are not conflicting with what
> > BIOS may be doing.
>
> So the feedback is to run the _DSM and just overwrite any BIOS
> values. Looking up the _DSM I saw there was an attempt to upstream
> this 4 years ago [1]. I'm not sure why the effort stalled but we can
> pick up this work again.
>
> https://patchwork.kernel.org/project/linux-pci/patch/20201015080311.7811-1-puranjay12@xxxxxxxxx/

There was a PCI SIG discussion about this a few years ago that never
really seemed to get resolved:
https://members.pcisig.com/wg/PCIe-Protocol/mail/thread/35064

Unfortunately that discussion is not public, but the summary is:

Q: How is the LTR_L1.2_THRESHOLD value determined?

PCIe r5.0, sec 5.5.4, says the same value must be programmed into
both Ports.

A: As noted in sec 5.5.4, the value is determined primarily by
the amount of time it will take to re-establish the common
mode bias on the AC coupling caps, and it is assumed that the
BIOS knows this.

Q: How are the LTR Max Snoop values determined?

PCI Firmware r3.3, sec 4.6.6, says the LTR _DSM reports the max
values for each Downstream Port embedded in the platform, and the
OS should calculate latencies along the path between each
Downstream Port and any Upstream Port (Switch Upstream Port or
Endpoint).

Of course, Switches not embedded in the platform (e.g., external
Thunderbolt hierarchies) will not have this _DSM, but I assume
they should contribute to this sum?

A: The fundamental problem is that there is no practical way for
software to discover the AC coupling capacitor sizes and
common mode bias circuit impedance.

Software could compute conservative values, but they would
likely be 10x worse than typical, so the L1.2 exit latency
would be significantly longer than actually required to be.

The interoperability issues here were understood when
designing L1 Substates, but no viable solution was found.

So the main reason Puranjay's work got stalled is that I didn't feel
confident enough that we understood how to do this, especially for
external devices.

It would be great if somebody *did* feel confident about interpreting
and implementing all this.

Bjorn