Re: [PATCH] PCI: aardvark: Implement re-issuing config requests on CRS response
From: Pali Rohár
Date: Sun Sep 26 2021 - 07:13:30 EST
On Wednesday 22 September 2021 11:48:03 Bjorn Helgaas wrote:
> On Wed, Sep 22, 2021 at 12:17:36PM +0200, Pali Rohár wrote:
> > On Thursday 16 September 2021 09:32:57 Bjorn Helgaas wrote:
> > > On Wed, Sep 15, 2021 at 12:55:53PM +0200, Pali Rohár wrote:
> > > ...
>
> > > > I'm not sure how "legal" it is if userspace / setpci changes some of
> > > > these bits. At least on a hardware with a real Root Port device it
> > > > should be fully transparent. As hardware handles this re-issue and
> > > > kernel then would see (reissued) response.
> > >
> > > If setpci changes bits like these, all bets are off. We can't tell
> > > what happened, so we can't rely on any configuration Linux did. I
> > > think we really should taint the kernel when this happens.
> >
> > For testing purposes, setpci is still a very good tool.
>
> Yes, absolutely!
>
> > > > Test case: Initialize device, then unbind it from sysfs, reset it (hot
> > > > reset or warm reset) and then rescan / reinit it again. Here device is
> > > > permitted to send CRS response.
> > > >
> > > > We know that more PCIe cards are buggy and sometimes firmware on cards
> > > > crashes or resets card logic. Which may put card into initialization
> > > > state when it is again permitted to send CRS response.
> > >
> > > Yep. That's a buggy device and normally we would work around it with
> > > a quirk. This particular kind of bug would be hard to work around,
> > > but a host bridge driver doesn't seem like the right place to do it
> > > because we'd have to do it in *every* such driver.
> >
> > This described firmware crashing & card reset logic I saw in more wifi
> > cards. Sometimes even wifi drivers itself detects that card does not
> > respond and do some its own internal card reset (e.g. iwldvm on laptop).
> > So it very common situation.
> >
> > But I have not seen that these cards on laptop issue CRS response. Maybe
> > because their firmware or PCIe logic bootup too fast (so there is a very
> > little window for CRS response) or because CRS response sent to OS did
> > not cause any issue.
>
> After a reset, we normally delay 100ms *before* issuing the first
> config request to the device, e.g., [1]. I expect that in most cases
> the device has completed its initialization in that 100ms and it never
> responds with CRS status.
>
> > So no particular workaround is needed for above described scenario.
> >
> >
> > But anyway, in case that in future there would be need for disabling CRS
> > feature in kernel (e.g. for doing some workaround for endpoint or
> > extended pcie switch) then this re-issuing of config request on CRS
> > response in pci-aardvark.c would be needed to have similar behavior like
> > real HW hen CRS is disabled.
>
> To be pedantic, there's no such thing as "disabling CRS". CRS is a
> required feature with no enable/disable bit. There is only "enabling
> or disabling CRS *Software Visibility*".
Of course, I mean that CRSSVE bit.
> The config read of Vendor ID after a reset should be done by the PCI
> core, not a device driver.
Of course. But in case of unexpected reset (which PCI code does not
detect), card driver at the same time could issue some config read/write
request.
> If we disable CRS SV, the only outcomes of
> that read are:
>
> 1) Valid Vendor ID data, or
>
> 2) Failed transaction, typically reported as 0xffff data (and, I
> expect, an Unsupported Request or similar error logged)
Yes. And I think it should apply also for any other config register, not
just vendor id.
In case error reporting or AER functionality is not supported then there
would be no error logged. And PCI core / kernel does not have to know
that such thing happened.
Anyway, there is probably a candidate wifi card with "not ready" issue:
https://lore.kernel.org/linux-wireless/CAHp75Vd5iCLELx8s+Zvcj8ufd2bN6CK26soDMkZyC1CwMO2Qeg@xxxxxxxxxxxxxx/
> In either case there may have been zero or more retries on PCIe. The
> PCI core needs to handle the failure case sensibly even though it has
> no idea whether the read has been retried.
>
> > And I like the idea if driver is "feature complete" and prepared also
> > for other _valid_ code paths. This is just my opinion.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c?id=v5.14#n4651