Re: [PATCH v1] PCI: Fixup the RTIT_BAR of Intel TH on Denverton

From: Bjorn Helgaas
Date: Thu Oct 26 2017 - 12:51:42 EST


On Thu, Oct 26, 2017 at 05:28:17PM +0300, Alexander Shishkin wrote:
> On some integrations of the Intel(R) Trace Hub (for a reference and
> overview see Documentation/trace/intel_th.txt) the reported size of
> one of its resources (RTIT_BAR) doesn't match its actual size, which
> leads to overlaps with other devices' resources.
>
> On a Denverton platform it overlaps with XHCI MMIO space, which results
> in the xhci driver bailing out after seeing its registers as 0xffffffff,
> and perceived disappearance of all USB devices:
>
> > intel_th_pci 0000:00:1f.7: enabling device (0004 -> 0006)
> > xhci_hcd 0000:00:15.0: xHCI host controller not responding, assume dead
> > xhci_hcd 0000:00:15.0: xHC not responding in xhci_irq, assume controller is dead
> > xhci_hcd 0000:00:15.0: HC died; cleaning up
> > usb 1-1: USB disconnect, device number 2
> ...
>
> For this reason, we need to resize the RTIT_BAR on Denverton to its
> actual size, which in this case is 4MB.

How do you figure out the actual size? Per sec 11.1 of the spec you
reference, RTIT_BAR (BAR2) tells us the address of the RTMR, which is
a *variable size* MMIO space.

RTIT_LBAR (sec 13.8.11) suggests that the BAR size should be between
256 and 4096 bytes. How do you get to 4MB? Is that a constant size
regardless of system configuration?

> Signed-off-by: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
> Link: https://software.intel.com/sites/default/files/managed/d3/3c/intel-th-developer-manual.pdf
> Fixes: 5118ccd34780 ("intel_th: pci: Add Denverton SOC support")

Is there an erratum for this? If the PCI core gets the wrong number
when it sizes the BAR, it *sounds* like a hardware or possibly a BIOS
defect.

You reference 5118ccd34780 ("intel_th: pci: Add Denverton SOC
support"), but if the PCI core can't size the BAR correctly, we're
susceptible to this problem any time this device is in the system,
regardless of whether the intl_th driver claims it.

So I think referencing 5118ccd34780 is a red herring and will lead
people to wrongly conclude that "I don't need this quirk because my
kernel doesn't include 5118ccd34780."

> Cc: stable@xxxxxxxxxxxxxxx
> ---
> drivers/pci/quirks.c | 16 ++++++++++++++++

It looks like this could go in arch/x86/pci/fixup.c? (There are a lot
of things in drivers/pci/quirks.c that look like they are actually
x86-specific.)

> 1 file changed, 16 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index a4d33619a7bb..d321ea6427b8 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4799,3 +4799,19 @@ static void quirk_no_ats(struct pci_dev *pdev)
> /* AMD Stoney platform GPU */
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats);
> #endif /* CONFIG_PCI_ATS */
> +
> +static void quirk_intel_th_dnv(struct pci_dev *dev)
> +{
> + struct resource *r = &dev->resource[4];
> +
> + /*
> + * Denverton reports 2k of RTIT_BAR (intel_th resource 4), which
> + * appears to be 4 MB in reality.
> + */
> + if (r->end == r->start + 0x7ff) {
> + r->start = 0;
> + r->end = 0x3fffff;
> + r->flags |= IORESOURCE_UNSET;

Did you check that this results in the device being disabled? I'm
concerned that if BIOS leaves the device enabled and consuming 4MB of
space (but reporting only 2KB), we may have a conflict. I'm not sure
that updating the resource and marking it IORESOURCE_UNSET will
actually result in fixing the conflict. E.g., maybe we can't actually
assign 4MB so we don't touch the device itself, so it remains enabled.

> + }
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x19e1, quirk_intel_th_dnv);
> --
> 2.14.2
>