USB-C DisplayPort display failing to stay active with Intel Barlow Ridge USB4 controller, power-management related issue?

From: Aaron Rainbolt
Date: Wed Oct 09 2024 - 23:01:36 EST


We're experiencing a Linux kernel bug affecting multiple Clevo X370SNx1
laptops (specifically the X370SNW1 variant). The bug appears to be
present in kernels greater than or equal to 6.5, worsening
significantly with kernel 6.11.2 (latest stable at time of this
writing). It is unclear if all of the issues encountered are the same
bug, however the primary problem we've run into appears to be a
consequence of the power management code involving Intel Barlow Ridge
controllers and DisplayPort. The issue occurs with in-kernel Nouveau
drivers and also with proprietary NVIDIA drivers.

When a DisplayPort monitor is attached to these laptops via a USB-C
connection, the monitor is recognized by the system and comes on for
approximately 15 seconds. It then blanks out and is automatically
disconnected from the system as if it had been unplugged. It will
remain that way indefinitely until unplugged and replugged, or until
something "jiggles" (for lack of a better term) the thunderbolt driver.
When either of these things occur, the display will re-attach and come
back on for 15 seconds, then blank out and detach again. There are
various different things that can "jiggle" the thunderbolt driver,
including but not limited to:

* Running `lspci -k` (this one came as a particular surprise)
* Removing and re-inserting the thunderbolt driver (`sudo modprobe -r
thunderbolt; sleep 1; sudo modprobe thunderbolt`)
* Running `nvidia-detector` while proprietary NVIDIA drivers are loaded

It is possible to mitigate this issue by simply running
`sudo modprobe -r thunderbolt` or `sudo rmmod thunderbolt` and then
leaving the driver unloaded. USB-C displays become stable after this -
they are recognized when attached and remain recognized and functional
indefinitely as one would expect.

We believe this is related to the Intel Barlow Ridge USB4 controller
because:

* Removing the thunderbolt driver restores normal display operation.
* This issue was *not* a problem on Clevo X370SNx machines, which are
identical to the X370SNx1 except for the Maple Ridge TBT controller
on the board has been replaced with a Barlow Ridge USB4 controller.
* This problem does not occur on the affected models with the 6.1
kernel. It occurs with the 6.5 kernel and on all newer kernels we
have tried.

Furthermore, from inspecting the Thunderbolt driver code, we believe
this is related to the power management features of the driver, because:

* There is only one 15-second timeout defined in the driver source
code, that being TB_AUTOSUSPEND_DELAY in drivers/thunderbolt/tb.h
* On earlier kernels (Ubuntu’s variant of 6.8 at least), displays are
stable even when the thunderbolt driver is loaded if we:
* Remove the thunderbolt driver
* Attach a USB-C dock
* Attach displays to the dock (we used 2 4K HDMI monitors)
* Reload the thunderbolt driver

During our investigation, we discovered commit
a75e0684efe567ae5f6a8e91a8360c4c1773cf3a (patch on mailing list at
https://lore.kernel.org/linux-usb/20240213114318.3023150-1-mika.westerberg@xxxxxxxxxxxxxxx/)
which appears to be a fix for this exact problem. It adds a quirk for
Intel Barlow Ridge controllers, which detects when a DisplayPort device
has been plugged directly into the USB4 port (thus using "redrive"
mode), and instructs the power management subsystem to not power the
chip down during this time if so. Unfortunately, this quirk seems to be
silently ignored, as we built a custom kernel with some `printk` lines
added to the `tb_enter_redrive` and `tb_exit_redrive` functions to
announce when they were called, and nothing in the dmesg log indicated
that they had been called when we did this.

This bug is easily reproducible using the stock kernels in Kubuntu
22.04, Kubuntu 24.04, Kali Linux 2024.2, and Fedora Workstation
Rawhide. Similar behavior is observed across all of these distributions.

We built the 6.11.2 kernel from source and tested it on Kubuntu 24.04,
but while the kernel built, installed, and functioned properly in most
respects, it actually made the problem with USB-C displays worse. As
long as the thunderbolt driver was loaded, no displays were detected
when plugged in (not for even a short length of time), and when the
thunderbolt driver was unloaded, displays would only be recognized and
function if there was only one display attached. Attaching a second
display resulted in the first external display becoming detached and
the second display not coming on. Unplugging the second display
resulted in the first display reattaching. This machine supports up to
three external displays and this has proven to be achievable and stable
with earlier kernels. No valuable error messages were logged in dmesg
when these problems occurred.

Our testing has been limited to the Clevo X370SNW1 model, however we
expect that the X370SNV1 model will exhibit the same issues as it uses
very similar internal components on the system board.

This is basically the extent of our knowledge at this point. We
attempted various patches on Ubuntu's 6.8 kernel to resolve the issue,
all without success:

* We attempted reverting fd4d58d1fef9ae9b0ee235eaad73d2e0a6a73025
(thunderbolt: Enable CL2 low power state), which had no effect.
* We noticed that one of the Barlow Ridge bridge controllers
listed by `lspci -k` appeared to not have its device ID in
drivers/thunderbolt/nhi.h and there was a corresponding quirk in
drivers/thunderbolt/quirks.c that looked like it might be vaguely
related to the issue (specifically quirk_usb3_maximum_bandwidth), so
we tried adding that device to the appropriate files in order to make
that quirk apply to that device as well, this had no visible effect
on the kernel's operation and did not resolve the issue.
* After narrowing it down to `quirk_block_rpm_in_redrive`, we attempted
adding a new `thunderbolt.kf_force_redrive` kernel parameter in
drivers/thunderbolt/tb.c that forced the code in
`tb_enter_redrive` and `tb_exit_redrive` to be executed even *if* the
device didn't have the appropriate quirk bit set, in the hopes that
this would make the quirk execute and resolve the issue. What ended
up happening was somehow `tb_enter_redrive` was never called at all
and `tb_exit_redrive` was called. This in turn made it so that no
USB-C displays would even be recognized for a short period of time if
the thunderbolt driver was loaded.
* Looking at PCI vendor IDs, we noticed that the PCI vendor ID used to
recognize all Intel controllers in drivers/thunderbolt/quirks.c was
0x8087, whereas the Barlow Ridge controller in our device reported a
vendor ID of 0x8086. On the off chance that this was a typo of epic
proportions, we tried adjusting all of the occurrences of 0x8087 in
the tb_quirks[] array to PCI_VENDOR_ID_INTEL (which is defined as
0x8086 in include/linux/pci_ids.h). This has no visible effect on the
kernel's behavior, and did not resolve the issue. (Presumably there's
something going on with the IDs there that we're not aware of.)

As to my speculation as to what's wrong, I believe this is likely a
combination of two things:

* Some data in the `tb_quirks` array in drivers/thunderbolt/quirks.c is
incorrect and leading to the Barlow Ridge controllers not being
recognized as needing the DisplayPort redrive mode quirk.
* The code in drivers/thunderbolt/tb.c `tb_dp_resource_unavailable`
that controls whether or not to run `tb_enter_redrive` is faulty in
some way and is not calling `tb_enter_redrive` in all scenarios where
it is necessary. To be clear, the exact code I'm talking about is
this chunk from the aforementioned function:

tunnel = tb_find_tunnel(tb, TB_TUNNEL_DP, in, out);
if (tunnel)
tb_deactivate_and_free_tunnel(tunnel);
else
tb_enter_redrive(port);

Finally, this is probably a result of me misreading the driver code
somehow, but I was surprised by the following conditional at the top
of `tb_enter_redrive`:

if (!(sw->quirks & QUIRK_KEEP_POWER_IN_DP_REDRIVE))
return;

To me this reads as "if the DP redrive quirk bit is set, return and do
nothing. Otherwise, if the bit is not set, run the quirk function."
This is the opposite of what I would expect - shouldn't the code run if
the bit is set, not if it is clear? Or does the bit being unset mean
that the quirk is active? (I do not believe that this is the root cause
of the issue because even when I forced this function to run any time
it was invoked, it wasn't being invoked at all.)

This issue has only been definitively reproduced on already-EOL kernels
due to the (potentially related) problem encountered with 6.11.2.
However based on a code comparison it appears all of the apparently
relevant code (that which deals with the DP quirk) is identical between
Ubuntu's variation of the 6.8 kernel and the tip of the mainline master
branch. Therefore I believe this issue very likely impacts the latest
mainline kernel.