Re: [PATCH v2 0/2] thunderbolt: Fix S4 resume incongruities

From: Mario Limonciello

Date: Tue May 05 2026 - 17:21:32 EST


On 1/29/26 17:13, Katiyar, Pooja wrote:
Hello,

On Mon, Jan 19, 2026 at 02:13:50PM -0800, Kenneth Crudup wrote:

On 1/19/26 11:59, Mario Limonciello (AMD) (kernel.org) wrote:

On 1/17/2026 10:57 AM, Katiyar, Pooja wrote:

I have confirmation the hack patch does help the issue for us too.

If your patch doesn't work another logical solution could be to destroy
all the tunnels as part of the PM freeze callback (not just the DP
resources).  Maybe even unify the suspend and freeze codepaths for more
opportunities for code reuse?


Thanks for confirming the hack patch helps!

We are actually working on a solution that releases the DP resources and
suspends the switch as part of the freeze sequence. This way the hibernation
image that is stored doesn't contain any active tunnels, and during resume
we get a DP hotplug notification for a new tunnel, similar to S5. So far
this patch is working fine but is under review.


Thanks.  If you want early testing from us too before you're ready to
post publicly feel free to ping it offline to me too.

I'd like to get a CC: on that, too.

I've been testing that hack patch and will test further later tonight.

The issue I'm trying to chase down (and not sure if any of this will
help with this, I wonder if it's really BIOS/EC related) is often times
that after a suspend (or hibernate, but I use "suspend then hibernate",
which I think does both and chooses which to use upon resume) and then
connect to a different dock (or setup) from the one I'd suspended with,
sometimes I have to unplug/replug my TB cable, otherwise I either get no
recognition of my new display setup (and sometimes TB devices) or it'll
try and use the same monitor resolution of the previously-connected
monitor (as if the TB subsystem doesn't recognize things have changed).


Below is the patch series that addresses mentioned issue. There are two
patches in this series. The series takes care of releasing the DP resources
as part of freeze call before the hibernation image is created. You can test
it for your issues and let us know if it helps.

Please note that these changes are still under internal review and are
subject to change.

From: Pooja Katiyar <pooja.katiyar@xxxxxxxxx>
Date: Thu, 22 Jan 2026 11:57:07 -0800
Subject: [RFC PATCH 1/2] thunderbolt: Add helper functions for suspend/resume
operations

Extract common resume logic from tb_resume_noirq() into
tb_recover_tunnels() helper function to consolidate switch resume,
tunnel discovery/teardown, tunnel recreation, and topology
reinitialization logic.

Introduce __nhi_suspend_ops() and __nhi_resume_noirq() helper functions
to consolidate common suspend/resume logic used by multiple PM callbacks.

This is a preparatory cleanup for hibernation improvements.

Co-developed-by: Rene Sapiens <rene.sapiens@xxxxxxxxxxxxxxx>
Signed-off-by: Rene Sapiens <rene.sapiens@xxxxxxxxxxxxxxx>
Signed-off-by: Pooja Katiyar <pooja.katiyar@xxxxxxxxx>
---
drivers/thunderbolt/nhi.c | 46 ++++++++++++++++++++++++++-------------
drivers/thunderbolt/tb.c | 34 ++++++++++++++++++++---------
2 files changed, 55 insertions(+), 25 deletions(-)

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index 6d0c9d37c55d..5b84223937fb 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -971,7 +971,14 @@ static irqreturn_t nhi_msi(int irq, void *data)
return IRQ_HANDLED;
}
-static int __nhi_suspend_noirq(struct device *dev, bool wakeup)
+static int __nhi_suspend_ops(struct tb_nhi *nhi, bool wakeup)
+{
+ if (nhi->ops && nhi->ops->suspend_noirq)
+ return nhi->ops->suspend_noirq(nhi, wakeup);
+ return 0;
+}
+
+static int nhi_suspend_noirq(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct tb *tb = pci_get_drvdata(pdev);
@@ -982,18 +989,7 @@ static int __nhi_suspend_noirq(struct device *dev, bool wakeup)
if (ret)
return ret;
- if (nhi->ops && nhi->ops->suspend_noirq) {
- ret = nhi->ops->suspend_noirq(tb->nhi, wakeup);
- if (ret)
- return ret;
- }
-
- return 0;
-}
-
-static int nhi_suspend_noirq(struct device *dev)
-{
- return __nhi_suspend_noirq(dev, device_may_wakeup(dev));
+ return __nhi_suspend_ops(nhi, device_may_wakeup(dev));
}
static int nhi_freeze_noirq(struct device *dev)
@@ -1029,10 +1025,17 @@ static bool nhi_wake_supported(struct pci_dev *pdev)
static int nhi_poweroff_noirq(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
+ struct tb *tb = pci_get_drvdata(pdev);
+ struct tb_nhi *nhi = tb->nhi;
bool wakeup;
+ int ret;
+
+ ret = tb_domain_suspend_noirq(tb);
+ if (ret)
+ return ret;
wakeup = device_may_wakeup(dev) && nhi_wake_supported(pdev);
- return __nhi_suspend_noirq(dev, wakeup);
+ return __nhi_suspend_ops(nhi, wakeup);
}
static void nhi_enable_int_throttling(struct tb_nhi *nhi)
@@ -1051,7 +1054,7 @@ static void nhi_enable_int_throttling(struct tb_nhi *nhi)
}
}
-static int nhi_resume_noirq(struct device *dev)
+static int __nhi_resume_noirq(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct tb *tb = pci_get_drvdata(pdev);
@@ -1074,6 +1077,19 @@ static int nhi_resume_noirq(struct device *dev)
nhi_enable_int_throttling(tb->nhi);
}
+ return 0;
+}
+
+static int nhi_resume_noirq(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct tb *tb = pci_get_drvdata(pdev);
+ int ret;
+
+ ret = __nhi_resume_noirq(dev);
+ if (ret)
+ return ret;
+
return tb_domain_resume_noirq(tb);
}
diff --git a/drivers/thunderbolt/tb.c b/drivers/thunderbolt/tb.c
index 4f5f1dfc0fbf..69015def6cd1 100644
--- a/drivers/thunderbolt/tb.c
+++ b/drivers/thunderbolt/tb.c
@@ -3110,22 +3110,21 @@ static void tb_restore_children(struct tb_switch *sw)
}
}
-static int tb_resume_noirq(struct tb *tb)
+/**
+ * tb_recover_tunnels() - Common resume logic for suspend and hibernate
+ * @tb: Domain to resume
+ *
+ * Common recovery logic shared between suspend resume and hibernate restore.
+ * Handles switch resume, tunnel discovery/teardown, tunnel recreation, and
+ * topology reinitialization.
+ */
+static void tb_recover_tunnels(struct tb *tb)
{
struct tb_cm *tcm = tb_priv(tb);
struct tb_tunnel *tunnel, *n;
unsigned int usb3_delay = 0;
LIST_HEAD(tunnels);
- tb_dbg(tb, "resuming...\n");
-
- /*
- * For non-USB4 hosts (Apple systems) remove any PCIe devices
- * the firmware might have setup.
- */
- if (!tb_switch_is_usb4(tb->root_switch))
- tb_switch_reset(tb->root_switch);
-
tb_switch_resume(tb->root_switch, false);
tb_free_invalid_tunnels(tb);
tb_free_unplugged_children(tb->root_switch);
@@ -3166,6 +3165,21 @@ static int tb_resume_noirq(struct tb *tb)
tb_switch_enter_redrive(tb->root_switch);
/* Allow tb_handle_hotplug to progress events */
tcm->hotplug_active = true;
+}
+
+static int tb_resume_noirq(struct tb *tb)
+{
+ tb_dbg(tb, "resuming...\n");
+
+ /*
+ * For non-USB4 hosts (Apple systems) remove any PCIe devices
+ * the firmware might have setup.
+ */
+ if (!tb_switch_is_usb4(tb->root_switch))
+ tb_switch_reset(tb->root_switch);
+
+ tb_recover_tunnels(tb);
+
tb_dbg(tb, "resume finished\n");
return 0;

Sorry for the insanely big delay, but we were getting totally unexpected results with your patch series. It was helping; but then even with it taken off it was also helping :P.

It turned out the issue that prompted my series got fixed in another way. It took some time to figure out why. It was fixed by this change:

15b1d7b77e9836ff4184093163174a1ef28bbdd7

That is basically because of the order of events the kernel does we were having a link training failure. My series (and your patches) did change the order of events to adjust it. But instead we are catching this sequence in the GPU driver and handling it more cleanly.