Re: VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

From: Bjorn Helgaas
Date: Tue May 21 2019 - 16:14:34 EST


[fix linux-pci, remove ethan.zhao (bounces)]

From: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Date: Tue, May 21, 2019 at 3:02 PM
To: Himanshu Madhani
Cc: ethan.zhao@xxxxxxxxxx, Andrew Vasquez, Girish Basrur, Giridhar
Malavali, Myron Stowe, <linux-pci@xxxxxxxxxx>, Linux Kernel Mailing
List, Quinn Tran

> [+cc Myron, Quinn, linux-pci, linux-kernel]
>
> From: Himanshu Madhani <hmadhani@xxxxxxxxxxx>
> Date: Fri, May 17, 2019 at 5:21 PM
> To: ethan.zhao@xxxxxxxxxx, bhelgaas@xxxxxxxxxx
> Cc: Andrew Vasquez, Girish Basrur, Giridhar Malavali
>
> > Hi Ethan,
> >
> > Our OEM partners reported to us that VPD access with latest distros were returning I/O error for them. They indicated this to be issue only with newer kernels.
> >
> > One of the distro vendor pointed out patch posted by you to be reason for IO error trying to VPD. The patch looks like blocks access to VPD by blacklisting ISP.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d5370d1d85251e5893ab7c90a429464de2e140bï;
> >
> > I setup PCIe analyzer to reproduce this in our lab to root cause it and discovered that after reverting the patch. I am able to get VPD data okay with upstream 5.1.0 and I used RHEL8.
> >
> > I also used "lspci" and "cat" to dump out VPD data and do not see any issue.
> >
> > # lspci -vvv -s 03:00.0
> > 03:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
> > Subsystem: QLogic Corp. QLE2742 Dual Port 32Gb Fibre Channel to PCIe Adapter
> > Physical Slot: 15
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 67
> > NUMA node: 0
> > Region 0: Memory at fbe05000 (64-bit, prefetchable) [size=4K]
> > Region 2: Memory at fbe02000 (64-bit, prefetchable) [size=8K]
> > Region 4: Memory at fbd00000 (64-bit, prefetchable) [size=1M]
> > Expansion ROM at fb540000 [disabled] [size=256K]
> > Capabilities: [44] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [4c] Express (v2) Endpoint, MSI 00
> > DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <4us, L1 <1us
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <2us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
> > AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > AtomicOpsCtl: ReqEn-
> > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> > Compliance De-emphasis: -6dB
> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
> > EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> > Capabilities: [88] Vital Product Data
> > Product Name: QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter
> > Read-only fields:
> > [PN] Part number: QLE2742
> > [SN] Serial number: RFD1706R22611
> > [EC] Engineering changes: BK3210408-05 04
> > [V9] Vendor specific: 010189
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > End
> > Capabilities: [90] MSI-X: Enable+ Count=16 Masked-
> > Vector table: BAR=2 offset=00000000
> > PBA: BAR=2 offset=00001000
> > Capabilities: [9c] Vendor Specific Information: Len=0c <?>
> > Capabilities: [100 v1] Advanced Error Reporting
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> > HeaderLog: 00000000 00000000 00000000 00000000
> > Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
> > ARICap: MFVC- ACS-, Next Function: 1
> > ARICtl: MFVC- ACS-, Function Group: 0
> > Capabilities: [1c0 v1] #19
> > Capabilities: [1f4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
> > Kernel driver in use: qla2xxx
> > Kernel modules: qla2xxx
> >
> > # cat /sys/bus/pci/devices/0000\:03\:00.0/vpd
> > RFD1706R22611ECBK3210408-05 04V9010189RVïx
> >
> > Can you share some more insight into where you encountered issue? I am in process of reverting this patch from upstream kernel but wanted to reach out and find out if you still have setup to provide more context.
>
> 0d5370d1d852 ("PCI: Prevent VPD access for QLogic ISP2722") prevented
> a panic while reading VPD, so we can't simply revert it.
>
> Since you don't see a panic while reading VPD from that device, it's
> possible that a QLogic firmware change fixed the VPD format so Linux
> no longer reads the area that caused the problem. Or possibly your
> system doesn't handle the config read error the same way Ethan's HP
> DL380 does. Unfortunately we don't have an actual PCIe analyzer trace
> from Ethan's system, so we don't know exactly what happened on PCIe.
>
> I suggest that you capture the entire VPD area and hexdump it, e.g.,
> with "xxd", and look at its structure. pci_vpd_size() parses it and
> computes the valid size based on a PCI_VPD_STIN_END tag, and
> pci_vpd_read() should not read past that size.
>
> And you *do* have an analyzer trace. If new QLogic firmware fixed the
> VPD format, the trace should show that Linux read only the valid part
> of VPD, and there should be no errors in the trace. Then it might
> just be a question of tweaking the quirk so it allows VPD reads if the
> firmware is new enough.
>
> But if the trace does show config reads with errors, then it might be
> that your system just tolerates the errors while the DL380 did not.
> Then we'd have to figure out exactly what the error was and how to
> deal with it so things work on both your system and Ethan's.
>
> Bjorn