Re: [PATCH 1/8] pci: Make sriov work with hotplug removal

From: Bjorn Helgaas
Date: Mon Oct 17 2011 - 18:12:57 EST


On Mon, Oct 17, 2011 at 12:08 PM, Yinghai Lu <yinghai.lu@xxxxxxxxxx> wrote:
> On 10/17/2011 10:16 AM, Bjorn Helgaas wrote:
>>
>> On Sat, Oct 15, 2011 at 7:31 PM, Yinghai Lu<yinghai.lu@xxxxxxxxxx>  wrote:
>>>
>>> When hot remove pci express module, got
>>>
>>> [ 5918.610127] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 1
>>> [ 5918.615779] pciehp 0000:80:02.2:pcie04: Attention button interrupt
>>> received
>>> [ 5918.622730] pciehp 0000:80:02.2:pcie04: Button pressed on Slot(3)
>>> [ 5918.629002] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status:
>>> SLOTCTRL a8 value read 1f9
>>> [ 5918.637416] pciehp 0000:80:02.2:pcie04: PCI slot #3 - powering off due
>>> to button press.
>>> [ 5918.647125] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 10
>>> [ 5918.653039] pciehp 0000:80:02.2:pcie04: pciehp_green_led_blink:
>>> SLOTCTRL a8 write cmd 200
>>> [ 5918.661229] pciehp 0000:80:02.2:pcie04: pciehp_set_attention_status:
>>> SLOTCTRL a8 write cmd c0
>>> [ 5924.667627] pciehp 0000:80:02.2:pcie04: Disabling
>>> domain:bus:device=0000:b0:00
>>> [ 5924.674909] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status:
>>> SLOTCTRL a8 value read 2f9
>>> [ 5924.683262] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device:
>>> domain:bus:dev = 0000:b0:00
>>> [ 5924.693976] libfcoe_device_notification: NETDEV_UNREGISTER eth6
>>> [ 5924.764979] libfcoe_device_notification: NETDEV_UNREGISTER eth14
>>> [ 5924.873539] libfcoe_device_notification: NETDEV_UNREGISTER eth15
>>> [ 5924.995209] libfcoe_device_notification: NETDEV_UNREGISTER eth16
>>> [ 5926.114407] sxge 0000:b2:00.0: PCI INT A disabled
>>> [ 5926.119342] BUG: unable to handle kernel NULL pointer dereference at
>>> (null)
>>> [ 5926.127189] IP: [<ffffffff81353a3b>] pci_stop_bus_device+0x33/0x83
>>> [ 5926.133377] PGD 0
>>> [ 5926.135402] Oops: 0000 [#1] SMP
>>> [ 5926.138659] CPU 2
>>> [ 5926.140499] Modules linked in:
>>> ...
>>> [ 5926.143754]
>>> [ 5926.275823] Call Trace:
>>> [ 5926.278267]  [<ffffffff81353a38>] pci_stop_bus_device+0x30/0x83
>>> [ 5926.284180]  [<ffffffff81353af4>] pci_remove_bus_device+0x1a/0xba
>>> [ 5926.290264]  [<ffffffff81366311>]
>>> pciehp_unconfigure_device+0x110/0x17b
>>> [ 5926.296866]  [<ffffffff81365dd9>] ? pciehp_disable_slot+0x188/0x188
>>> [ 5926.303123]  [<ffffffff81365d6f>] pciehp_disable_slot+0x11e/0x188
>>> [ 5926.309206]  [<ffffffff81365e68>] pciehp_power_thread+0x8f/0xe0
>>> ...
>>>
>>> Root cause: when doing pci_stop_bus_device() with phys fn, it will stop
>>> virt fn and
>>> remove the fn, so
>>>        list_for_each_safe(l, n,&bus->devices)
>>> will have problem to refer freed n that is pointed to vf entry.
>>>
>>> Solution is just call pci_stop_bus_device() with phys fn only. and before
>>> that need to
>>> save phys fn aside and avoid to use bus->devices to loop.
>>>
>>> Signed-off-by: Yinghai Lu<yinghai@xxxxxxxxxx>
>>>
>>> ---
>>>  drivers/pci/remove.c |   33 +++++++++++++++++++++++++++++++++
>>>  1 file changed, 33 insertions(+)
>>>
>>> Index: linux-2.6/drivers/pci/remove.c
>>> ===================================================================
>>> --- linux-2.6.orig/drivers/pci/remove.c
>>> +++ linux-2.6/drivers/pci/remove.c
>>> @@ -120,10 +120,43 @@ void pci_remove_behind_bridge(struct pci
>>>                        pci_remove_bus_device(pci_dev_b(l));
>>>  }
>>>
>>> +struct dev_list {
>>> +       struct pci_dev *dev;
>>> +       struct list_head list;
>>> +};
>>> +
>>>  static void pci_stop_bus_devices(struct pci_bus *bus)
>>>  {
>>>        struct list_head *l, *n;
>>> +       struct dev_list *dl, *dn;
>>> +       LIST_HEAD(physfn_list);
>>> +
>>> +       /* Save phys_fn aside at first */
>>> +       list_for_each(l,&bus->devices) {
>>> +               struct pci_dev *dev = pci_dev_b(l);
>>> +
>>> +               if (!dev->is_virtfn) {
>>> +                       dl = kmalloc(sizeof(*dl), GFP_KERNEL);
>>> +                       if (!dl)
>>> +                               continue;
>>> +                       dl->dev = dev;
>>> +                       list_add_tail(&dl->list,&physfn_list);
>>> +               }
>>> +       }
>>> +
>>> +       /*
>>> +        * stop bus device for phys_fn at first
>>> +        *  it will stop and remove vf in driver remove action
>>> +        */
>>> +       list_for_each_entry_safe(dl, dn,&physfn_list, list) {
>>> +               struct pci_dev *dev = dl->dev;
>>> +
>>> +               pci_stop_bus_device(dev);
>>> +
>>> +               kfree(dl);
>>> +       }
>>>
>>> +       /* Do it again for left over in case */
>>>        list_for_each_safe(l, n,&bus->devices) {
>>>                struct pci_dev *dev = pci_dev_b(l);
>>>                pci_stop_bus_device(dev);
>>
>> Apparently it's a problem to stop a VF before its PF?  Why is that?
>> Sounds like there's an important dependency here, but I can't figure
>> it out.  I would have naively expected that you would *want* to stop a
>> VF first, since it depends on the PF.
>
> when you stop the VF, that VF will not be removed.
>
> when you stop the PF, the driver will be unloaded, it will *STOP* the VF at
> first and *REMOVE* the VF there. So the bus->devices will be touch at that
> time.
> for example when you have three VFs. those VFs will be removed from
> bus->devices, and n in the list_for_each_safe will have invalid pointer to
> freed entry.
>
> So stop the VF at first will *NOT* help.

No need to shout. I wasn't suggesting that stopping VFs first was the
right solution. It's just that stopping PFs first violates the
general design pattern of "discover top-down and remove bottom-up," so
it's not obvious why this is correct.

Let me see if I understand this. Here's the path where I think the problem is:

pciehp_unconfigure_device
pci_remove_bus_device
pci_stop_bus_device
pci_stop_bus_devices(subordinate)
list_for_each(...) <-- A
pci_stop_bus_device(PF)
pci_stop_dev
device_unregister
...
driver->remove, e.g., igb_remove
pci_disable_sriov
sriov_disable
virtfn_remove
pci_remove_bus_device
pci_destroy_dev
<remove from bus_list> <-- B

The hotplug removal starts with pci_remove_bus_device(), which calls
pci_stop_bus_device() on the PF, which ultimately calls the
driver->remove() method, which (since it's the PF) calls
pci_disable_sriov(), which destroys all the VFs and re-enters
pci_remove_bus_device() for each of them.

We're manipulating the same list at A and B, so when we get back to A
after removing all the VFs, the list has been changed out from under
us. Your patch avoids the problem by first putting the PFs on a
separate list and walking that instead, so the list at A is now
different from the list at B. Right?

Maybe this is the best we can do, but it still doesn't seem ideal, and
it's certainly not obvious when reading the code. It doesn't seem
right for the driver ->remove() method to be calling
pci_destroy_dev(). Won't the core data structures be corrupted if a
defective driver doesn't call pci_disable_sriov()? Seems like we
could end up with a device that's been physically removed, but still
has pci_dev structs hanging around.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/