Re: Thunderbolt driver hotplug not working correctly

From: Andreas Noever
Date: Fri Aug 15 2014 - 17:14:38 EST


(apparently I hit "reply" instead of "reply all" sometime back, sorry
for that. Readding ccs)

On Fri, Aug 15, 2014 at 7:35 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
> On Fri, Aug 15, 2014 at 04:03:08PM +0200, Andreas Noever wrote:
>> On Fri, Aug 15, 2014 at 2:48 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
>> > On Fri, Aug 15, 2014 at 5:41 AM, Andreas Noever
>> > <andreas.noever@xxxxxxxxx> wrote:
>> >> On Fri, Aug 15, 2014 at 1:24 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
>> >>> On Wed, Aug 13, 2014 at 4:05 PM, Andreas Noever
>> >>> <andreas.noever@xxxxxxxxx> wrote:
>> >>>> Hello Steven,
>> >>>>
>> >>>> I think that there are two problems:
>> >>>> - The Kernel does not notice that the device is gone.
>> >>>> - The first hotplug operation, after removing a coldplugged device fails.
>> >>>>
>> >>>> For the first one could you check whether thie pciehp (sub)-driver is loaded?
>> >>>> (dmesg | grep pciehp should show something, the config option is
>> >>>> CONFIG_HOTPLUG_PCI_PCIE).
>> >>>>
>> >>>> I was able reproduce the second problem on my machine. Could you test whether
>> >>>> this patch fixes the problem?
>> >>>>
>> >>>
>> >>> With the patch I see that PCI bridge 09:00.0 survives the hotplug
>> >>> events, but the bridge at 0a:00.0 and the Ethernet controller don't
>> >>> survive.
>> >>
>> >> Is CONFIG_HOTPLUG_PCI_PCIE set? Any output from pciehp?
>> >
>> > CONFIG_HOTPLUG_PCI_PCIE=y
>> >
>> > Aug 15 04:17:55 twoflower kernel: pci_hotplug: PCI Hot Plug PCI Core
>> > version: 0.5
>> > Aug 15 04:17:55 twoflower kernel: pciehp: Using ACPI for slot detection.
>> > Aug 15 04:17:55 twoflower kernel: pciehp 0000:07:00.0:pcie24: Slot #0
>> > AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> > Aug 15 04:17:55 twoflower kernel: pciehp 0000:07:00.0:pcie24: service
>> > driver pciehp loaded
>> > Aug 15 04:17:55 twoflower kernel: pciehp: PCI Express Hot Plug
>> > Controller Driver version: 0.4
>> >
>> > And that's all I get from pciehp.
>>
>> 07:00 is not one of the downstream ports. The driver should bind to
>> 07:03-06. (On my system :00 does not even have the hotplug cap set).
>>
>> Does pciehp.pciehp_force=1 help?
>
> That looks more sensible.
>
> Aug 15 10:20:18 twoflower kernel: Command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
> Aug 15 10:20:18 twoflower kernel: Kernel command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
> Aug 15 10:20:18 twoflower kernel: pciehp: Using ACPI for slot detection.
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Bypassing BIOS check for pciehp use on 0000:00:1c.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:00:1c.0:pcie04: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:00.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Device 0000:08:00.0 already exists at 0000:08:00, cannot hot-add
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Cannot add device at 0000:08:00
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:03.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Slot #3 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Device 0000:09:00.0 already exists at 0000:09:00, cannot hot-add
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot add device at 0000:09:00
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:04.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:04.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:04.0:pcie24: Slot #4 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:04.0:pcie24: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:05.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:05.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:05.0:pcie24: Slot #5 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:05.0:pcie24: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:06.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:06.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:06.0:pcie24: Slot #6 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:06.0:pcie24: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Bypassing BIOS check for pciehp use on 0000:0a:00.0
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Slot #9 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl- LLActRep+
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Timeout on hotplug command 0x00000000 (issued 0 msec ago)
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Device 0000:0b:00.0 already exists at 0000:0b:00, cannot hot-add
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Cannot add device at 0000:0b:00
> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: service driver pciehp loaded
> Aug 15 10:20:18 twoflower kernel: pciehp: PCI Express Hot Plug Controller Driver version: 0.4
>
> Though the "cannot hot-add" lines are worrying. The above is a boot with
> the Ethernet dongle attached at boot.

Yes this is strange. Either the hp driver is getting spurious hotplug
events or the thunderbolt driver tries to hotplug the already
configured device. Can you send me the full dmesg and lspci -vvnn
output for this scenario? Please also pass pciehp.pciehp_debug=1 to
the kernel.

> And here's a hotplug attempt (which at least successfully *removes* the device
> from the tg3 driver's perspective, but hot-adding the device still fails):
>
> Aug 15 10:24:03 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Down event
> Aug 15 10:24:03 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot remove display device 0000:09:00.0
> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Up event
> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Device 0000:09:00.0 already exists at 0000:09:00, cannot hot-add
> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot add device at 0000:09:00
> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Down event
> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot remove display device 0000:09:00.0
> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Card not present on Slot(3-1)
> Aug 15 10:24:06 twoflower kernel: pciehp 0000:0a:00.0:pcie24: unloading service driver pciehp
> Aug 15 10:24:06 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Timeout on hotplug command 0x00001038 (issued 232550 msec ago)
> Aug 15 10:24:27 twoflower kernel: pciehp 0000:07:03.0:pcie24: Card present on Slot(3-1)
> Aug 15 10:24:27 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Up event
> Aug 15 10:24:27 twoflower kernel: pciehp 0000:07:03.0:pcie24: Link Up event ignored on slot(3-1): already powering on
> Aug 15 10:24:27 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Bypassing BIOS check for pciehp use on 0000:0a:00.0
> Aug 15 10:24:27 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Slot #9 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl- LLActRep+
> Aug 15 10:24:27 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Timeout on hotplug command 0x00000000 (issued 0 msec ago)
> Aug 15 10:24:47 twoflower kernel: pciehp 0000:07:03.0:pcie24: Card not present on Slot(3-1)
> Aug 15 10:24:47 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Down event
> Aug 15 10:24:47 twoflower kernel: pciehp 0000:07:03.0:pcie24: Link Down event ignored on slot(3-1): already powering off


"Cannot remove display device 0000:09:00.0"... The message comes from
http://lxr.free-electrons.com/source/drivers/pci/hotplug/pciehp_pci.c#L112

The pciehp driver tries to read from the removed device (which returns
0xffff) and thus it thinks that the VGA flag is set. I have no idea
why presence is true here (it is read a few lines earlier). This is of
course a little bit racy..

> Without the dongle attached at boot, the thunderbolt driver (and rest of the
> kernel, for that matter) still stays silent when hotplugging it:
>
> Aug 15 10:26:24 twoflower kernel: Command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
> Aug 15 10:26:24 twoflower kernel: Kernel command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
> Aug 15 10:26:24 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Bypassing BIOS check for pciehp use on 0000:00:1c.0
> Aug 15 10:26:24 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
> Aug 15 10:26:24 twoflower kernel: pciehp 0000:00:1c.0:pcie04: service driver pciehp loaded
> Aug 15 10:26:24 twoflower kernel: pciehp: PCI Express Hot Plug Controller Driver version: 0.4
>
> Looking in lspci, it appears a bunch of devices (all of 06:00.0 and up) are
> missing, which explains the thunderbolt driver's silence. Does Apple's firmware
> only announce that the thunderbolt bus exists when a device is attached at
> boot?

Yes, you can try passing acpi_osi=Darwin. If that makes 06:00 etc.
appear then I would also be interested in dmesg and lspci -vvnn.

Thanks,
Andreas

>>
>> >>>>
>> >>>> ---
>> >>>> drivers/thunderbolt/path.c | 21 ++++++++++++++++++++-
>> >>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> >>>>
>> >>>> diff --git a/drivers/thunderbolt/path.c b/drivers/thunderbolt/path.c
>> >>>> index 8fcf8a7..9562cd0 100644
>> >>>> --- a/drivers/thunderbolt/path.c
>> >>>> +++ b/drivers/thunderbolt/path.c
>> >>>> @@ -150,7 +150,26 @@ int tb_path_activate(struct tb_path *path)
>> >>>>
>> >>>> /* Activate hops. */
>> >>>> for (i = path->path_length - 1; i >= 0; i--) {
>> >>>> - struct tb_regs_hop hop;
>> >>>> + struct tb_regs_hop hop = { 0 };
>> >>>> +
>> >>>> + /*
>> >>>> + * We do (currently) not tear down paths setup by the firmeware.
>> >>>> + * If a firmware device is unplugged and plugged in again then
>> >>>> + * it can happen that we reuse some of the hops from the (now
>> >>>> + * defunct) firmeware path. This causes the hotplug operation to
>> >>>> + * fail (the pci device does not show up). Clearing the hop
>> >>>> + * before overwriting it fixes the problem.
>> >>>> + *
>> >>>> + * Should be removed once we discover and tear down firmeware
>> >>>> + * paths.
>> >>>> + */
>> >>>> + res = tb_port_write(path->hops[i].in_port, &hop, TB_CFG_HOPS,
>> >>>> + 2 * path->hops[i].in_hop_index, 2);
>> >>>> + if (res) {
>> >>>> + __tb_path_deactivate_hops(path, i);
>> >>>> + __tb_path_deallocate_nfc(path, 0);
>> >>>> + goto err;
>> >>>> + }
>> >>>>
>> >>>> /* dword 0 */
>> >>>> hop.next_hop = path->hops[i].next_hop_index;
>> >>>> --
>> >>>> 2.0.4
>> >>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/