Re: [PATCH] PCI: aardvark: Implement re-issuing config requests on CRS response

From: Bjorn Helgaas
Date: Wed Sep 22 2021 - 12:48:08 EST


On Wed, Sep 22, 2021 at 12:17:36PM +0200, Pali Rohár wrote:
> On Thursday 16 September 2021 09:32:57 Bjorn Helgaas wrote:
> > On Wed, Sep 15, 2021 at 12:55:53PM +0200, Pali Rohár wrote:
> > ...

> > > I'm not sure how "legal" it is if userspace / setpci changes some of
> > > these bits. At least on a hardware with a real Root Port device it
> > > should be fully transparent. As hardware handles this re-issue and
> > > kernel then would see (reissued) response.
> >
> > If setpci changes bits like these, all bets are off. We can't tell
> > what happened, so we can't rely on any configuration Linux did. I
> > think we really should taint the kernel when this happens.
>
> For testing purposes, setpci is still a very good tool.

Yes, absolutely!

> > > Test case: Initialize device, then unbind it from sysfs, reset it (hot
> > > reset or warm reset) and then rescan / reinit it again. Here device is
> > > permitted to send CRS response.
> > >
> > > We know that more PCIe cards are buggy and sometimes firmware on cards
> > > crashes or resets card logic. Which may put card into initialization
> > > state when it is again permitted to send CRS response.
> >
> > Yep. That's a buggy device and normally we would work around it with
> > a quirk. This particular kind of bug would be hard to work around,
> > but a host bridge driver doesn't seem like the right place to do it
> > because we'd have to do it in *every* such driver.
>
> This described firmware crashing & card reset logic I saw in more wifi
> cards. Sometimes even wifi drivers itself detects that card does not
> respond and do some its own internal card reset (e.g. iwldvm on laptop).
> So it very common situation.
>
> But I have not seen that these cards on laptop issue CRS response. Maybe
> because their firmware or PCIe logic bootup too fast (so there is a very
> little window for CRS response) or because CRS response sent to OS did
> not cause any issue.

After a reset, we normally delay 100ms *before* issuing the first
config request to the device, e.g., [1]. I expect that in most cases
the device has completed its initialization in that 100ms and it never
responds with CRS status.

> So no particular workaround is needed for above described scenario.
>
>
> But anyway, in case that in future there would be need for disabling CRS
> feature in kernel (e.g. for doing some workaround for endpoint or
> extended pcie switch) then this re-issuing of config request on CRS
> response in pci-aardvark.c would be needed to have similar behavior like
> real HW hen CRS is disabled.

To be pedantic, there's no such thing as "disabling CRS". CRS is a
required feature with no enable/disable bit. There is only "enabling
or disabling CRS *Software Visibility*".

The config read of Vendor ID after a reset should be done by the PCI
core, not a device driver. If we disable CRS SV, the only outcomes of
that read are:

1) Valid Vendor ID data, or

2) Failed transaction, typically reported as 0xffff data (and, I
expect, an Unsupported Request or similar error logged)

In either case there may have been zero or more retries on PCIe. The
PCI core needs to handle the failure case sensibly even though it has
no idea whether the read has been retried.

> And I like the idea if driver is "feature complete" and prepared also
> for other _valid_ code paths. This is just my opinion.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c?id=v5.14#n4651