Re: Global lock for PCI configuration accesses

From: Thierry Reding
Date: Thu Sep 10 2015 - 09:00:27 EST


On Wed, Sep 09, 2015 at 12:27:38PM -0500, Bjorn Helgaas wrote:
> Hi Thierry,
>
> On Wed, Sep 9, 2015 at 10:11 AM, Thierry Reding
> <thierry.reding@xxxxxxxxx> wrote:
> > Hi,
> >
> > There's currently an issue with PCI configuration space accesses on
> > Tegra. The PCI host controller driver's ->map_bus() implementation
> > remaps I/O memory on-demand to avoid potentially wasting 256 MiB of
> > virtual address space. The reason why this is done is because the
> > mapping isn't compatible with ECAM and the extended register number
> > is encoded in the uppermost 4 bits. This means that if we want to
> > address the configuration space for a single bus we already need to
> > map 256 MiB of memory, even if only 1 MiB is really used.
> >
> > tegra_pcie_bus_alloc() is therefore used to stitch together a 1 MiB
> > block of virtual addresses per bus made up of 16 64 KiB chunks each
> > so that only what's really needed is mapped.
> >
> > That function gets called the first time a PCI configuration access
> > is performed on a bus. The code calls functions that may sleep, and
> > that causes problems because the PCI configuration space accessors
> > are called with the global pci_lock held. This works in practice
> > but it blows up when lockdep is enabled.
> >
> > I remember coding up a fix for this using the ARM/PCI ->add_bus()
> > callbacks at one point and then forgetting about it. When I wanted
> > to revive that patch a little while ago I noticed that ->add_bus()
> > is now gone.
>
> Removed by 6cf00af0ae15 ("ARM/PCI: Remove unused pcibios_add_bus() and
> pcibios_remove_bus()"), I think. That only removed the ARM
> implementation; the hook itself is still called, but on every arch
> except x86 and ia64, we use the default no-op implementation. You
> could add it back, I guess. It was removed because the MSI-related
> stuff that used to be in the ARM version is now done in a more generic
> way (see 49dcc01a9ff2 ("ARM/PCI: Save MSI controller in
> pci_sys_data")).
>
> > What I'm asking myself now is how to fix this. I suppose it'd be
> > possible to bring back ->add_bus(), though I suspect there were good
> > reasons to remove it (portability?).
>
> > Another possible fix would be
> > to get rid of the spinlock protecting these accesses. It seems to me
> > like it's not really necessary in the majority of cases. For drivers
> > that do a simple readl() or writel() on some memory-mapped I/O the
> > lock doesn't protect anything.
>
> I've wondered about removing pci_lock, too. It seems like it could be
> removed in principle, but it would be a lot of work to audit
> everything. Probably more work than you want to do just to fix Tegra
> :)

Thinking more about this, I'm not sure removing the lock would improve
the situation much. It seems like everything assumes that accesses to
PCI configuration space can be done in interrupt context, so what we do
on Tegra wouldn't be valid anyway.

I'm not sure if that's a reasonable assumption though. If it isn't then
removing the lock (or pushing it down into drivers as necessary) might
be the right thing to do.

> > Then again, there are a lot of pci_ops implementations in the tree,
> > and simply removing the global lock seems like it'd have a good chance
> > of breaking things for somebody.
> >
> > So short of auditing all pci_ops implementations and pushing the lock
> > down into drivers, does anyone have any good ideas on how to fix this?
>
> The 32-bit version of pci_mmcfg_read() uses fixmap to map the page it
> needs on-demand. Could you do something similar, i.e., allocate the
> virtual space (which I assume is the part that might sleep), then
> redirect the virt-to-phys mapping while holding the lock?

I hadn't known about fixmap yet, but unfortunately I don't think that
would work in this case. In particular we'd need to map 256 MiB in order
to address the configuration space of a single device. Well, if we do
the mapping directly in the configuration space accessor we could map
a single 64 KiB chunk at a time, depending upon which register is being
accesses. That's still quite a lot of code that would be running for
every single configuration space access.

Moreover this is really only a problem that happens during enumeration.
After enumeration completes the configuration space address ranges for
each bus will have been allocated and subsequent accesses would no
longer execute the slow paths.

Given that and the assumption that PCI configuration space might be
accessed from interrupt context it seems like a more suitable option to
add back in some way of calling into the driver everytime a new bus gets
created. I guess we could even reuse ->map_bus(), but it might be better
to add a ->add_bus() to struct pci_ops to more clearly separate the slow
from the fast path.

Any objections to adding ->add_bus() to struct pci_ops? That does have
the benefit of being more portable than adding back the pcibios hooks
that were previously removed.

Thierry

Attachment: signature.asc
Description: PGP signature