Re: QCA6174 pcie wifi: Add pci quirks

From: Bjorn Helgaas
Date: Tue May 25 2021 - 18:12:24 EST


On Thu, Apr 15, 2021 at 09:53:38PM +0200, Pali Rohár wrote:
> Hello!
>
> On Thursday 15 April 2021 13:01:19 Alex Williamson wrote:
> > [cc +Pali]
> >
> > On Thu, 15 Apr 2021 20:02:23 +0200
> > Ingmar Klein <ingmar_klein@xxxxxx> wrote:
> >
> > > First thanks to you both, Alex and Bjorn!
> > > I am in no way an expert on this topic, so I have to fully rely on your
> > > feedback, concerning this issue.
> > >
> > > If you should have any other solution approach, in form of patch-set, I
> > > would be glad to test it out. Just let me know, what you think might
> > > make sense.
> > > I will wait for your further feedback on the issue. In the meantime I
> > > have my current workaround via quirk entry.
> > >
> > > By the way, my layman's question:
> > > Do you think, that the following topic might also apply for the QCA6174?
> > > https://www.spinics.net/lists/linux-pci/msg106395.html
>
> I have been testing more ath cards and I'm going to send a new version
> of this patch with including more PCI ids.

Dropping this patch in favor of Pali's new version.

> > > Or in other words, should a similar approach be tried for the QCA6174
> > > and if yes, would it bring any benefit at all?
> > > I hope you can excuse me, in case the questions should not make too much
> > > sense.
> >
> > If you run lspci -vvv on your device, what do LnkCap and LnkSta report
> > under the express capability? I wonder if your device even supports
> > >Gen1 speeds, mine does not.
> >
> > I would not expect that patch to be relevant to you based on your
> > report. I understand it to resolve an issue during link retraining to a
> > higher speed on boot, not during a bus reset. Pali can correct if I'm
> > wrong. Thanks,
>
> These two issues are are related. Both operations (PCIe Hot Reset and
> PCIe Link Retraining) cause reset of ath chips. Seems that they cause
> double reset. After reset these chips reads configuration from internal
> EEPROM/OTP and if another reset is triggered prior chip finishes
> internal configuration read then it stops working. My testing showed
> that ath10k chips completely disappear from the PCIe bus, some ath9k
> chips works fine but starts reporting incorrect PCI ID (0xABCD) and some
> other ath9k chips reports correct PCI ID but does not work. I had
> discussion with Adrian Chadd who knows probably everything about ath9k
> and confirmed me that this issue is there with ath9k and ath10k chips.
>
> He wrote me that workaround to turn card back from this "broken" state
> is to do PCIe Cold Reset of the card, which means turning power supply
> off for particular PCIe slot. Such thing is not supported on many
> low-end boards, so workaround cannot be applied.
>
> I was able to recover my testing cards from this "broken" state by PCIe
> Warm Reset (= reset via PERST# pin).
>
> I have tried many other reset methods (PCIe PM reset, Link Down, PCIe
> Hot Reset with bigger internal, ...) but nothing worked. So seems that
> the only workaround is to do PCIe Cold Reset or PCIe Warm Reset.
>
> I will send V2 of my patch with details and explanation.
>
> As kernel does not have API for doing PCIe Warm Reset, I think is
> another argument why kernel really needs it.
>
> I do not have any QCA6174 card for testing, but based on the fact I
> reproduced this issue with more ath9k and ath10 cards and Adrian
> confirmed that above reset issue is there, I think that it affects all
> AR9xxx and QCAxxxx cards handled by ath9k and ath10 drivers.
>
> I was told that AMI BIOS was patching their BIOSes found in notebooks to
> avoid triggering this issue on notebooks ath9k cards.
>
> > Alex
> >
> > > Am 15.04.2021 um 04:36 schrieb Alex Williamson:
> > > > On Wed, 14 Apr 2021 16:03:50 -0500
> > > > Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > >
> > > >> [+cc Alex]
> > > >>
> > > >> On Fri, Apr 09, 2021 at 11:26:33AM +0200, Ingmar Klein wrote:
> > > >>> Edit: Retry, as I did not consider, that my mail-client would make this
> > > >>> party html.
> > > >>>
> > > >>> Dear maintainers,
> > > >>> I recently encountered an issue on my Proxmox server system, that
> > > >>> includes a Qualcomm QCA6174 m.2 PCIe wifi module.
> > > >>> https://deviwiki.com/wiki/AIRETOS_AFX-QCA6174-NX
> > > >>>
> > > >>> On system boot and subsequent virtual machine start (with passed-through
> > > >>> QCA6174), the VM would just freeze/hang, at the point where the ath10k
> > > >>> driver loads.
> > > >>> Quick search in the proxmox related topics, brought me to the following
> > > >>> discussion, which suggested a PCI quirk entry for the QCA6174 in the kernel:
> > > >>> https://forum.proxmox.com/threads/pcie-passthrough-freezes-proxmox.27513/
> > > >>>
> > > >>> I then went ahead, got the Proxmox kernel source (v5.4.106) and applied
> > > >>> the attached patch.
> > > >>> Effect was as hoped, that the VM hangs are now gone. System boots and
> > > >>> runs as intended.
> > > >>>
> > > >>> Judging by the existing quirk entries for Atheros, I would think, that
> > > >>> my proposed "fix" could be included in the vanilla kernel.
> > > >>> As far as I saw, there is no entry yet, even in the latest kernel sources.
> > > >> This would need a signed-off-by; see
> > > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.11#n361
> > > >>
> > > >> This is an old issue, and likely we'll end up just applying this as
> > > >> yet another quirk. But looking at c3e59ee4e766 ("PCI: Mark Atheros
> > > >> AR93xx to avoid bus reset"), where it started, it seems to be
> > > >> connected to 425c1b223dac ("PCI: Add Virtual Channel to save/restore
> > > >> support").
> > > >>
> > > >> I'd like to dig into that a bit more to see if there are any clues.
> > > >> AFAIK Linux itself still doesn't use VC at all, and 425c1b223dac added
> > > >> a fair bit of code. I wonder if we're restoring something out of
> > > >> order or making some simple mistake in the way to restore VC config.
> > > > I don't really have any faith in that bisect report in commit
> > > > c3e59ee4e766. To double check I dug out the card from that commit,
> > > > installed an old Fedora release so I could build kernel v3.13,
> > > > pre-dating 425c1b223dac and tested triggering a bus reset both via
> > > > setpci and by masking PM reset so that sysfs can trigger the bus reset
> > > > path with the kernel save/restore code. Both result in the system
> > > > hanging when the device is accessed either restoring from the kernel
> > > > bus reset or reading from the device after the setpci reset. Thanks,
> > > >
> > > > Alex
> > > >
> > >
> >