Re: pci: kernel crash in bus_find_device

From: Guenter Roeck
Date: Thu May 22 2014 - 13:57:19 EST


On Thu, May 22, 2014 at 09:19:40AM -0700, Francesco Ruggeri wrote:
> Aborting a search does not sound like a correct solution.
> How does a higher level user (eg for_each_pci_dev) know that a search
> was aborted and decide whether it should try again, assuming it would
> be ok repeating the action on the devices visited the first time?
>
Agreed, it is less than desirable.

I would consider this to be a secondary problem, though, the immediate
problem being the crash. One possible solution might be to have the various
functions return error codes (ERR_PTR), but that would be quite invasive as
well. I really think we need input from Greg and, if the solution touches
the PCI subsystem, from Bjorn Helgaas to find an acceptable solution
to that problem.

Guenter

> Francesco
>
>
> On Thu, May 22, 2014 at 12:22 AM, Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> > On 05/22/2014 12:14 AM, Greg Kroah-Hartmann wrote:
> >>
> >> On Wed, May 21, 2014 at 03:59:58PM -0700, Guenter Roeck wrote:
> >>>
> >>> On Wed, May 21, 2014 at 01:04:04PM -0700, Francesco Ruggeri wrote:
> >>>>
> >>>> I have been using an x86 platform.
> >>>> When I started working on it I got early crashes until I added the
> >>>> check for p not NULL in
> >>>>
> >>>> +void bus_release_device(struct device *dev)
> >>>> +{
> >>>> + struct device_private *p = dev->p;
> >>>> +
> >>>> + if (p && klist_node_attached(&p->knode_bus))
> >>>> + klist_put_last(&p->knode_bus);
> >>>> +}
> >>>> +
> >>>>
> >>>> Maybe on powerpc *p is overriden between device_del and device_release?
> >>>>
> >>>> Or maybe some of the BUG_ONs in the patch? The ones on knode_dead are
> >>>> treated as WARN_ONs in the current klist code.
> >>>> The one in BUG_ON(!klist_dec_and_del(n)); is new, and in my tests I
> >>>> ran into it without the second patch (but only when I ran my module
> >>>> and tests).
> >>>>
> >>> Hi Francesco,
> >>>
> >>> I replaced the BUG_ON with WARN_ON; still crashes.
> >>>
> >>> Anyway, the problem seems to be known. I found two related exchanges.
> >>>
> >>> [1] describes pretty much the same problem. I don't see if/where it was
> >>> ever fixed, though.
> >>>
> >>> [2] is a patch to fix the problem. It did not apply cleanly to 3.14,
> >>> so I had to make some adjustments in klist_iter_init_node. Resulting
> >>> patch is below. With this patch, the problem is gone. It is not perfect,
> >>> as it aborts the loop if it encounters a deleted kobject, but it is
> >>> better
> >>> than nothing. Unfortunately, the patch never made it upstream; no idea
> >>> why.
> >>> Copying the author and Greg to get additional feedback.
> >>>
> >>> Guenter
> >>>
> >>> [1] https://lkml.org/lkml/2008/10/26/79
> >>> [2] https://lkml.org/lkml/2012/4/16/218
> >>
> >>
> >> 2 years ago? I have no idea what was up with that, sorry...
> >>
> >
> > Ok, but do you have comments on the patch itself in its current version ?
> >
> > Guenter
> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/