Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

From: Alex Deucher
Date: Thu Aug 13 2015 - 18:13:17 EST


On Thu, Aug 13, 2015 at 4:15 PM, Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
> On Thu, Aug 13, 2015 at 3:46 PM, Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
>> On Mon, Aug 10, 2015 at 9:06 PM, Jiang Liu <jiang.liu@xxxxxxxxxxxxxxx> wrote:
>>> On 2015/8/10 23:00, Alex Deucher wrote:
>>>> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <jiang.liu@xxxxxxxxxxxxxxx> wrote:
>>>>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>>>>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>>>>> With multi-MSI capable SATA controllers, only the first port works,
>>>>> all other ports times out when executing SATA commands. This regression
>>>>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>>>>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>>>>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>>>>> CPU interrupt vectors").
>>>>>
>>>>> With this patch applied, the affected SATA controllers work as expected.
>>>>
>>>> Yes, this fixes the SATA regression:
>>>> Tested-by: Alex Deucher <alexander.deucher@xxxxxxx>
>>>>
>>>> I'm not sure if it's related to this patch or not (I haven't bisected
>>>> it independently yet), but MSIs don't seem to work on GPUs. See the
>>>> line for amdgpu. This is just after loading the driver.
>>> Hi Alex,
>>> This patch only affects multiple-MSI, and it seems that your
>>> gpu only uses one MSI interrupt, so it may not be related to this patch.
>>> And this seems like a sort of interrupt storm.
>>>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>>>> 524288-edge amdgpu
>>>
>>> Does it make any change by disable interrupt remapping?
>>
>> Nope. Still going crazy:
>> 46: 4769660 4769130 4775899 4784657 PCI-MSI
>> 524288-edge amdgpu
>>
>>
>>> Does it make any change by disable MSI?
>>
>> If I set pci=nomsi, the sata controllers time out. If I disable MSIs
>> just for the gpu, I don't get any interrupts:
>> 25: 0 0 0 0 IR-IO-APIC
>> 0-fasteoi amdgpu
>>
>
> Strangely, it only seems to affect certain boards. E.g., this card works fine:
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Bonaire XT [Radeon HD 7790/8770 / R9 260 OEM] (prog-if 00
> [VGA controller])
> Subsystem: Diamond Multimedia Systems Device 2329
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort+ <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 52
> Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
> Region 2: Memory at d0000000 (64-bit, prefetchable) [size=8M]
> Region 4: I/O ports at e000 [size=256]
> Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
> Expansion ROM at ff640000 [disabled] [size=128K]
> Capabilities: [48] Vendor Specific Information: Len=08 <?>
> Capabilities: [50] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0-,D1+,D2+,D3hot+,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
> <4us, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
> Latency L0s <64ns, L1 <1us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
> OBFF Not Supported
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000 Data: 0000
> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
> Capabilities: [150 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [270 v1] #19
> Capabilities: [2b0 v1] Address Translation Service (ATS)
> ATSCap: Invalidate Queue Depth: 00
> ATSCtl: Enable+, Smallest Translation Unit: 00
> Capabilities: [2c0 v1] #13
> Capabilities: [2d0 v1] #1b
> Kernel driver in use: amdgpu
>
> This one does not:
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Device 6939 (prog-if 00 [VGA controller])
> Subsystem: Gigabyte Technology Co., Ltd Device 229d
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort+ <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 52
> Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
> Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
> Region 4: I/O ports at e000 [size=256]
> Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
> Expansion ROM at ff640000 [disabled] [size=128K]
> Capabilities: [48] Vendor Specific Information: Len=08 <?>
> Capabilities: [50] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0-,D1+,D2+,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
> <4us, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
> Latency L0s <64ns, L1 <1us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
> OBFF Not Supported
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000 Data: 0000
> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
> Capabilities: [150 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [200 v1] #15
> Capabilities: [270 v1] #19
> Capabilities: [2b0 v1] Address Translation Service (ATS)
> ATSCap: Invalidate Queue Depth: 00
> ATSCtl: Enable+, Smallest Translation Unit: 00
> Capabilities: [2c0 v1] #13
> Capabilities: [2d0 v1] #1b
> Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 1
> ARICtl: MFVC- ACS-, Function Group: 0
> Kernel driver in use: amdgpu
>
> Any ideas? I'll see if I can find the time to bisect this.

I attempted to bisect this, however the regression happened prior to
my driver being merged upstream:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=099bfbfc7fbbe22356c02f0caf709ac32e1126ea
So I can't easily bisect it further without backporting the driver to
each commit before that. This may take a while...

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/