Re: [PATCH 0/1] thunderbolt: Fix blank external display after HRR on USB4 v2
From: Chia-Lin Kao (AceLan)
Date: Wed May 27 2026 - 23:44:11 EST
Hi Mika,
Sorry for the late reply — I was away for two weeks in early May.
On Thu, Apr 30, 2026 at 12:03:11PM +0200, Mika Westerberg wrote:
> Hi,
>
> On Thu, Apr 30, 2026 at 03:31:42PM +0800, Chia-Lin Kao (AceLan) wrote:
> > Hi,
> >
> > On Dell XPS 14 (Panther Lake) with a WD22TB4 Thunderbolt dock and BenQ
> > PD2725U external display, the display goes permanently blank on ~50% of
> > boots. The only way to recover is a full reboot — re-plugging the
> > monitor or dock does not help.
> >
> > The root cause is a race between the USB4 v2 Host Router Reset (HRR)
> > and the graphics driver initialization:
> >
> > 1. nhi_probe() performs HRR at ~t=1s, destroying BIOS-established
> > DP tunnels.
> > 2. The Thunderbolt driver re-discovers the dock via hotplug at ~t=4s
> > and attempts to re-create the DP tunnel.
> > 3. DPRX negotiation fails because the graphics driver (xe) is not yet
> > ready — the 12-second timeout expires at ~t=18s.
> > 4. tb_dp_tunnel_active() permanently removes the DP IN adapter from
> > available resources on the first failure, so the display never
> > recovers.
> >
> > The fix adds a retry mechanism: on DPRX negotiation failure, the driver
> > retries up to 3 times with a 5-second delay, giving the graphics driver
> > time to come up.
> >
> > Tested with 13 boot cycles on the affected machine:
> > - 6 boots hit the HRR + DPRX race: all recovered via retry, display
> > came online after 3 retry attempts (~58s).
> > - 5 clean boots (no HRR): DP tunnel established immediately.
> > - 2 boots with HRR where DPRX succeeded on first try.
> > - 0 teardowns: the retry mechanism was never exhausted.
> >
> > Full dmesg log - https://people.canonical.com/~acelan/bugs/dp-retry-on-hrr/
>
> I'm looking at that but the first thing that stands out is this:
>
> [ 1.051684] thunderbolt: loading out-of-tree module taints kernel.
>
> Which tells me that this has some potential modifications outside of the
> mainline.
>
> Second thing is that it's missing "thunderbolt.dyndbg=+p" that could show
> what is going on there. I suggest adding that pretty much always.
>
> Yes, this can happen and the 12 s idea was that it accounts for the
> possible time that it takes to boot up (as well as the polling the i915
> does if it is runtime suspended). I would say that whatever is delaying the
> boot time should be investigated first because that's not really good user
> experience.
>
> Aside from that if you add "thunderbolt.dprx_timeout=-1" does it work? If
> really needed we can increase that a bit but I'm not too enthustiatic
> adding code for retrying this because we do have this timeout that we can
> adjust as needed (we can make the default higher).
Thank you for reviewing and for the helpful suggestions.
I have an update on this issue: we've since discovered that a BIOS update
(from 1.2.1/1.3.1 to 1.5.1) on this Dell XPS 14 (Panther Lake) appears to
have resolved the blank display problem.
Looking at what changed: with the old BIOS, the firmware pre-established
PCIe tunnels through the dock during early boot — the dock's xHCI
(07:00.0) and the OWC NVMe (18:00.0) were already enumerated by BIOS
before the kernel started. When nhi_probe() performed HRR at ~t=1s, it
destroyed those BIOS-established tunnels, killing xHCI mid-probe
("HC died; cleaning up") and causing the NVMe probe to fail with -EIO.
The subsequent DP tunnel re-creation then hit the DPRX timeout because
the graphics driver wasn't ready yet.
With BIOS 1.5.1, the firmware no longer pre-establishes PCIe tunnels to
dock devices — the TBT root port (00:07.0) doesn't even have IO port
space allocated anymore. This means HRR has nothing to destroy, and the
Thunderbolt driver handles all tunnel setup from scratch. We ran 30 reboot
cycles with the full device set (WD22TB4 dock, BenQ monitor, OWC Envoy
Express storage) and saw 0% blank display rate.
So it seems the root cause was the BIOS establishing tunnels that the
kernel's HRR would then tear down, creating the race condition. The BIOS
vendor fixed it by leaving tunnel establishment to the kernel entirely.
Given this, I think the retry patch is no longer needed for this specific
platform. That said, the underlying race (HRR destroying BIOS tunnels →
DPRX timeout → permanent DP IN removal) could still affect other USB4 v2
platforms where the BIOS does pre-establish tunnels. Would it still be
worth considering either:
a) increasing the default dprx_timeout, or
b) at minimum, not permanently removing the DP IN adapter on the first
DPRX failure?
Thanks again for the guidance.
Best regards,
AceLan Kao.