Re: [PATCH] PCI: Fix devres regression in pci_intx()
From: Philipp Stanner
Date: Wed Sep 04 2024 - 09:29:48 EST
On Wed, 2024-09-04 at 06:57 -0600, Alex Williamson wrote:
> On Wed, 04 Sep 2024 09:06:37 +0200
> Philipp Stanner <pstanner@xxxxxxxxxx> wrote:
>
> > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:
> > > On Thu, 25 Jul 2024 14:07:30 +0200
> > > Philipp Stanner <pstanner@xxxxxxxxxx> wrote:
> > >
> > > > pci_intx() is a function that becomes managed if
> > > > pcim_enable_device()
> > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add
> > > > managed
> > > > pcim_intx()") changed this behavior so that pci_intx() always
> > > > leads
> > > > to
> > > > creation of a separate device resource for itself, whereas
> > > > earlier,
> > > > a
> > > > shared resource was used for all PCI devres operations.
> > > >
> > > > Unfortunately, pci_intx() seems to be used in some drivers'
> > > > remove()
> > > > paths; in the managed case this causes a device resource to be
> > > > created
> > > > on driver detach.
> > > >
> > > > Fix the regression by only redirecting pci_intx() to its
> > > > managed
> > > > twin
> > > > pcim_intx() if the pci_command changes.
> > > >
> > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> > >
> > > I'm seeing another issue from this, which is maybe a more general
> > > problem with managed mode. In my case I'm using vfio-pci to
> > > assign
> > > an
> > > ahci controller to a VM.
> >
> > "In my case" doesn't mean OOT, does it? I can't fully follow.
>
> "OOT" Out Of Tree? No, "In my case" is simply introducing the
> scenario
> in which I see the issue. vfio-pci is an in-tree driver used to
> attach
> devices to userspace drivers, such as QEMU. The ahci driver is
> loaded
> during system boot, setting the is_managed flag. The ahci driver is
> then unbound from the device and the vfio-pci driver is bound. The
> vfio-pci driver provides a uAPI for userspace drivers to operate a
> device in an an IOMMU protected context.
Alright, thx for the clarification.
>
> > > ahci_init_one() calls pcim_enable_device()
> > > which sets is_managed = true. I notice that nothing ever sets
> > > is_managed to false. Therefore now when I call pci_intx() from
> > > vfio-
> > > pci
> > > under spinlock, I get a lockdep warning
> >
> > I suppose you see the lockdep warning because the new pcim_intx()
> > can
> > now allocate, whereas before 25216afc9db5 it was
> > pcim_enable_device()
> > which allocated *everything* related to PCI devres.
> >
> > > as I no go through pcim_intx()
> > > code after 25216afc9db5
> >
> > You alwas went through pcim_intx()'s logic. The issue seems to be
> > that
> > the allocation step was moved.
>
> Unintentionally, yes, I believe so. vfio-pci is not a managed,
> devres
> driver and therefore had no expectation of using the managed code
> path.
Yes, I agree this needs to be fixed through the solution you proposed.
>
> > > since the previous driver was managed.
> >
> > what do you mean by "previous driver"?
>
> As noted, the ahci driver is first bound to the device at boot,
> unbound, and the vfio-pci driver bound to the device. The ahci
> driver
> is the previous driver.
>
> > > It seems
> > > like we should be setting is_managed to false is the driver
> > > release
> > > path, right?
> >
> > So the issue seems to be that the same struct pci_dev can be used
> > by
> > different drivers, is that correct?
>
> Yes, and more generically, the driver release should undo everything
> that has been configured by the driver probe.
>
> > If so, I think that can be addressed trough having
> > pcim_disable_device() set is_managed to false as you suggest.
>
> If that's sufficient and drivers only call pcim_disable_device() in
> their release function.
pcim_disable_device() is not intended to be used directly by drivers.
It's basically the devres callback for pcim_enable_device() and is
called in everyone's release path automatically by devres.
(I agree that the naming is not superbe)
> I also note that f748a07a0b64 ("PCI: Remove
> legacy pcim_release()") claims that:
>
> Thanks to preceding cleanup steps, pcim_release() is now not needed
> anymore and can be replaced by pcim_disable_device(), which is the
> exact counterpart to pcim_enable_device().
>
> However, that's not accurate as pcim_enable_device() adds a devm
> action, unconditionally calls pci_enable_device() and sets is_managed
> to true.
It's not accurate in regards with is_managed.
The rest is fine IMO. The devres callback shall be added, and the
unconditional call to pci_enable_device() is also desired.
> If we assume pcim_pin_device() is a valid concept, don't we
> still need to remove the devm action as well?
No. As pcim_disable_device() is the very devres callback, it does not
need to remove itself. Devres calls it once and then it's gone.
However, pcim_pin_device() IMO is not a valid concept. Who wants such
behavior IMO shall use pci_enable_device() and pci_disable_device()
manually.
I proposed removing it here:
https://lore.kernel.org/all/20240822073815.12365-2-pstanner@xxxxxxxxxx/
(Note that previously it could not be removed because
pcim_enable_device() also allocated all the stuff needed by
pci_request_region() etc.)
>
> > Another solution can could at least consider would be to use a
> > GFP_ATOMIC for allocation in get_or_create_intx_devres().
>
> If we look at what pci_intx() does without devres, it's simply
> reading
> and setting or clearing a bit in config space. I can attest that a
> driver author would have no expectation that such a function
> allocates
> memory
A driver author would not have any expectation of a function implicitly
doing anything with devres. That's the reason why I did all this work
in the first place, to, ultimately, remove this hybrid behavior from
all pci_ functions.
So that devres functions are clearly marked with pcim_
That is, however, not that easy because everyone who uses
pcim_enable_device() in combination with pci_intx() first has to be
ported to a pci_intx()-version that has nothing to do with devres.
> and there are scenarios where we want to call this with
> interrupts disabled, such as within an interrupt context. So, TBH,
> it
> might make sense to consider whether an allocation in this path is
> appropriate at all, but I'm obviously no expert in devres.
The entire hybrid nature from pci_intx() should be removed.
I'm working on that.
The hybrid nature was always there, but the allocation was not. It
would be removed later together with the hybrid devres usage.
>
> > I suppose your solution is the better one, though.
>
> I see you've posted a patch, I'll test it as soon as I'm able.
If it works from what I understand that should solve those issues for
now until we can phase out pci_intx() usage that relies on the device
resource.
---
btw. I just looked into the old code and found that this one also
already had a similar half-bug with is_managed. It never sets it to
false again, but pci_intx() runs into:
static struct pci_devres *find_pci_dr(struct pci_dev *pdev)
{
if (pci_is_managed(pdev))
return devres_find(&pdev->dev, pcim_release, NULL,
NULL);
return NULL;
}
So in your case pci_is_managed() would have also been true and the only
reason no problem occurred is that devres_find() doesn't find the
device resource of the previous driver anymore, so pci_intx() won't use
that memory.
---
Thanks for debugging,
P.
> Thanks,
>
> Alex
>