Re: Thunderbolt driver hotplug not working correctly

From: Andreas Noever
Date: Tue Aug 26 2014 - 11:58:36 EST


On Fri, Aug 15, 2014 at 11:14 PM, Andreas Noever
<andreas.noever@xxxxxxxxx> wrote:
> (apparently I hit "reply" instead of "reply all" sometime back, sorry
> for that. Readding ccs)
>
> On Fri, Aug 15, 2014 at 7:35 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
>> On Fri, Aug 15, 2014 at 04:03:08PM +0200, Andreas Noever wrote:
>>> On Fri, Aug 15, 2014 at 2:48 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
>>> > On Fri, Aug 15, 2014 at 5:41 AM, Andreas Noever
>>> > <andreas.noever@xxxxxxxxx> wrote:
>>> >> On Fri, Aug 15, 2014 at 1:24 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
>>> >>> On Wed, Aug 13, 2014 at 4:05 PM, Andreas Noever
>>> >>> <andreas.noever@xxxxxxxxx> wrote:
>>> >>>> Hello Steven,
>>> >>>>
>>> >>>> I think that there are two problems:
>>> >>>> - The Kernel does not notice that the device is gone.
>>> >>>> - The first hotplug operation, after removing a coldplugged device fails.
>>> >>>>
>>> >>>> For the first one could you check whether thie pciehp (sub)-driver is loaded?
>>> >>>> (dmesg | grep pciehp should show something, the config option is
>>> >>>> CONFIG_HOTPLUG_PCI_PCIE).
>>> >>>>
>>> >>>> I was able reproduce the second problem on my machine. Could you test whether
>>> >>>> this patch fixes the problem?
>>> >>>>
>>> >>>
>>> >>> With the patch I see that PCI bridge 09:00.0 survives the hotplug
>>> >>> events, but the bridge at 0a:00.0 and the Ethernet controller don't
>>> >>> survive.
>>> >>
>>> >> Is CONFIG_HOTPLUG_PCI_PCIE set? Any output from pciehp?
>>> >
>>> > CONFIG_HOTPLUG_PCI_PCIE=y
>>> >
>>> > Aug 15 04:17:55 twoflower kernel: pci_hotplug: PCI Hot Plug PCI Core
>>> > version: 0.5
>>> > Aug 15 04:17:55 twoflower kernel: pciehp: Using ACPI for slot detection.
>>> > Aug 15 04:17:55 twoflower kernel: pciehp 0000:07:00.0:pcie24: Slot #0
>>> > AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>>> > Aug 15 04:17:55 twoflower kernel: pciehp 0000:07:00.0:pcie24: service
>>> > driver pciehp loaded
>>> > Aug 15 04:17:55 twoflower kernel: pciehp: PCI Express Hot Plug
>>> > Controller Driver version: 0.4
>>> >
>>> > And that's all I get from pciehp.
>>>
>>> 07:00 is not one of the downstream ports. The driver should bind to
>>> 07:03-06. (On my system :00 does not even have the hotplug cap set).
>>>
>>> Does pciehp.pciehp_force=1 help?
>>
>> That looks more sensible.
>>
>> Aug 15 10:20:18 twoflower kernel: Command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
>> Aug 15 10:20:18 twoflower kernel: Kernel command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
>> Aug 15 10:20:18 twoflower kernel: pciehp: Using ACPI for slot detection.
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Bypassing BIOS check for pciehp use on 0000:00:1c.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:00:1c.0:pcie04: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:00.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Device 0000:08:00.0 already exists at 0000:08:00, cannot hot-add
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: Cannot add device at 0000:08:00
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:00.0:pcie24: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:03.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Slot #3 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Device 0000:09:00.0 already exists at 0000:09:00, cannot hot-add
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot add device at 0000:09:00
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:03.0:pcie24: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:04.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:04.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:04.0:pcie24: Slot #4 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:04.0:pcie24: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:05.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:05.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:05.0:pcie24: Slot #5 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:05.0:pcie24: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:06.0:pcie24: Bypassing BIOS check for pciehp use on 0000:07:06.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:06.0:pcie24: Slot #6 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:07:06.0:pcie24: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Bypassing BIOS check for pciehp use on 0000:0a:00.0
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Slot #9 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl- LLActRep+
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Timeout on hotplug command 0x00000000 (issued 0 msec ago)
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Device 0000:0b:00.0 already exists at 0000:0b:00, cannot hot-add
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Cannot add device at 0000:0b:00
>> Aug 15 10:20:18 twoflower kernel: pciehp 0000:0a:00.0:pcie24: service driver pciehp loaded
>> Aug 15 10:20:18 twoflower kernel: pciehp: PCI Express Hot Plug Controller Driver version: 0.4
>>
>> Though the "cannot hot-add" lines are worrying. The above is a boot with
>> the Ethernet dongle attached at boot.
>
> Yes this is strange. Either the hp driver is getting spurious hotplug
> events or the thunderbolt driver tries to hotplug the already
> configured device. Can you send me the full dmesg and lspci -vvnn
> output for this scenario? Please also pass pciehp.pciehp_debug=1 to
> the kernel.
>
>> And here's a hotplug attempt (which at least successfully *removes* the device
>> from the tg3 driver's perspective, but hot-adding the device still fails):
>>
>> Aug 15 10:24:03 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Down event
>> Aug 15 10:24:03 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot remove display device 0000:09:00.0
>> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Up event
>> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Device 0000:09:00.0 already exists at 0000:09:00, cannot hot-add
>> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot add device at 0000:09:00
>> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Down event
>> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Cannot remove display device 0000:09:00.0
>> Aug 15 10:24:04 twoflower kernel: pciehp 0000:07:03.0:pcie24: Card not present on Slot(3-1)
>> Aug 15 10:24:06 twoflower kernel: pciehp 0000:0a:00.0:pcie24: unloading service driver pciehp
>> Aug 15 10:24:06 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Timeout on hotplug command 0x00001038 (issued 232550 msec ago)
>> Aug 15 10:24:27 twoflower kernel: pciehp 0000:07:03.0:pcie24: Card present on Slot(3-1)
>> Aug 15 10:24:27 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Up event
>> Aug 15 10:24:27 twoflower kernel: pciehp 0000:07:03.0:pcie24: Link Up event ignored on slot(3-1): already powering on
>> Aug 15 10:24:27 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Bypassing BIOS check for pciehp use on 0000:0a:00.0
>> Aug 15 10:24:27 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Slot #9 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl- LLActRep+
>> Aug 15 10:24:27 twoflower kernel: pciehp 0000:0a:00.0:pcie24: Timeout on hotplug command 0x00000000 (issued 0 msec ago)
>> Aug 15 10:24:47 twoflower kernel: pciehp 0000:07:03.0:pcie24: Card not present on Slot(3-1)
>> Aug 15 10:24:47 twoflower kernel: pciehp 0000:07:03.0:pcie24: slot(3-1): Link Down event
>> Aug 15 10:24:47 twoflower kernel: pciehp 0000:07:03.0:pcie24: Link Down event ignored on slot(3-1): already powering off
>
>
> "Cannot remove display device 0000:09:00.0"... The message comes from
> http://lxr.free-electrons.com/source/drivers/pci/hotplug/pciehp_pci.c#L112
>
> The pciehp driver tries to read from the removed device (which returns
> 0xffff) and thus it thinks that the VGA flag is set. I have no idea
> why presence is true here (it is read a few lines earlier). This is of
> course a little bit racy..
>
>> Without the dongle attached at boot, the thunderbolt driver (and rest of the
>> kernel, for that matter) still stays silent when hotplugging it:
>>
>> Aug 15 10:26:24 twoflower kernel: Command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
>> Aug 15 10:26:24 twoflower kernel: Kernel command line: BOOT_IMAGE=/vmlinuz-3.16.0-ec2-11383-gc9d2642-dirty root=UUID=6146fd5a-e8b0-449f-8ba4-36676f089aae rw earlyprintk=verbose loglevel=5 libata.force=noncq rootflags=data=writeback intel_pstate=disable i915.lvds_channel_mode=2 pciehp.pciehp_force=1
>> Aug 15 10:26:24 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Bypassing BIOS check for pciehp use on 0000:00:1c.0
>> Aug 15 10:26:24 twoflower kernel: pciehp 0000:00:1c.0:pcie04: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
>> Aug 15 10:26:24 twoflower kernel: pciehp 0000:00:1c.0:pcie04: service driver pciehp loaded
>> Aug 15 10:26:24 twoflower kernel: pciehp: PCI Express Hot Plug Controller Driver version: 0.4
>>
>> Looking in lspci, it appears a bunch of devices (all of 06:00.0 and up) are
>> missing, which explains the thunderbolt driver's silence. Does Apple's firmware
>> only announce that the thunderbolt bus exists when a device is attached at
>> boot?
>
> Yes, you can try passing acpi_osi=Darwin. If that makes 06:00 etc.
> appear then I would also be interested in dmesg and lspci -vvnn.

If you have time can you also run a test with the acpi patches
applied? These would be the last four patches from
https://github.com/anoever/thunderbolt/tree/acpi_rebased

Try applying those and booting without a TB device attached and
without acpi/pciehp parameters. Check that the TB controller is
present (06:00.0 and below) and that pciehp gets loeaded for 07:03-06.
Then plug in a TB device.


> Thanks,
> Andreas
>
>>>
>>> >>>>
>>> >>>> ---
>>> >>>> drivers/thunderbolt/path.c | 21 ++++++++++++++++++++-
>>> >>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>> >>>>
>>> >>>> diff --git a/drivers/thunderbolt/path.c b/drivers/thunderbolt/path.c
>>> >>>> index 8fcf8a7..9562cd0 100644
>>> >>>> --- a/drivers/thunderbolt/path.c
>>> >>>> +++ b/drivers/thunderbolt/path.c
>>> >>>> @@ -150,7 +150,26 @@ int tb_path_activate(struct tb_path *path)
>>> >>>>
>>> >>>> /* Activate hops. */
>>> >>>> for (i = path->path_length - 1; i >= 0; i--) {
>>> >>>> - struct tb_regs_hop hop;
>>> >>>> + struct tb_regs_hop hop = { 0 };
>>> >>>> +
>>> >>>> + /*
>>> >>>> + * We do (currently) not tear down paths setup by the firmeware.
>>> >>>> + * If a firmware device is unplugged and plugged in again then
>>> >>>> + * it can happen that we reuse some of the hops from the (now
>>> >>>> + * defunct) firmeware path. This causes the hotplug operation to
>>> >>>> + * fail (the pci device does not show up). Clearing the hop
>>> >>>> + * before overwriting it fixes the problem.
>>> >>>> + *
>>> >>>> + * Should be removed once we discover and tear down firmeware
>>> >>>> + * paths.
>>> >>>> + */
>>> >>>> + res = tb_port_write(path->hops[i].in_port, &hop, TB_CFG_HOPS,
>>> >>>> + 2 * path->hops[i].in_hop_index, 2);
>>> >>>> + if (res) {
>>> >>>> + __tb_path_deactivate_hops(path, i);
>>> >>>> + __tb_path_deallocate_nfc(path, 0);
>>> >>>> + goto err;
>>> >>>> + }
>>> >>>>
>>> >>>> /* dword 0 */
>>> >>>> hop.next_hop = path->hops[i].next_hop_index;
>>> >>>> --
>>> >>>> 2.0.4
>>> >>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/