[PATCH net] pds_core: Fix pdsc_check_pci_health function to print warning

From: Brett Creeley
Date: Thu Mar 21 2024 - 02:40:22 EST


When the driver notices fw_status == 0xff it tries to perform a PCI
reset on itself via pci_reset_function() in the context of the driver's
health thread. However, pdsc_reset_prepare calls
pdsc_stop_health_thread(), which attempts to stop/flush the health
thread. This results in a deadlock because the stop/flush will never
complete since the driver called pci_reset_function() from the health
thread context. Fix this by changing the pdsc_check_pci_health_function()
to print a dev_warn() once every fw_down/fw_up cycle and requiring the
user to perform a reset on the device via sysfs's reset interface,
reloading the driver, rebinding the device, etc.

Unloading the driver in the fw_down/dead state uncovered another issue,
which can be seen in the following trace:

WARNING: CPU: 51 PID: 6914 at kernel/workqueue.c:1450 __queue_work+0x358/0x440
[...]
RIP: 0010:__queue_work+0x358/0x440
[...]
Call Trace:
<TASK>
? __warn+0x85/0x140
? __queue_work+0x358/0x440
? report_bug+0xfc/0x1e0
? handle_bug+0x3f/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? __queue_work+0x358/0x440
queue_work_on+0x28/0x30
pdsc_devcmd_locked+0x96/0xe0 [pds_core]
pdsc_devcmd_reset+0x71/0xb0 [pds_core]
pdsc_teardown+0x51/0xe0 [pds_core]
pdsc_remove+0x106/0x200 [pds_core]
pci_device_remove+0x37/0xc0
device_release_driver_internal+0xae/0x140
driver_detach+0x48/0x90
bus_remove_driver+0x6d/0xf0
pci_unregister_driver+0x2e/0xa0
pdsc_cleanup_module+0x10/0x780 [pds_core]
__x64_sys_delete_module+0x142/0x2b0
? syscall_trace_enter.isra.18+0x126/0x1a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fbd9d03a14b
[...]

Fix this by preventing the devcmd reset if the FW is not running.

Fixes: d9407ff11809 ("pds_core: Prevent health thread from running during reset/remove")
Reviewed-by: Shannon Nelson <shannon.nelson@xxxxxxx>
Signed-off-by: Brett Creeley <brett.creeley@xxxxxxx>
---
drivers/net/ethernet/amd/pds_core/core.c | 9 ++++++++-
drivers/net/ethernet/amd/pds_core/core.h | 1 +
drivers/net/ethernet/amd/pds_core/dev.c | 3 +++
3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 9662ee72814c..8e5e3797cf0c 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -587,6 +587,9 @@ void pdsc_fw_up(struct pdsc *pdsc)
DEVLINK_HEALTH_REPORTER_STATE_HEALTHY);
pdsc_notify(PDS_EVENT_RESET, &reset_event);

+ /* Allow for fw_status == 0xff to print another warning */
+ pdsc->bad_pci_warned = false;
+
return;

err_out:
@@ -607,7 +610,11 @@ static void pdsc_check_pci_health(struct pdsc *pdsc)
if (fw_status != PDS_RC_BAD_PCI)
return;

- pci_reset_function(pdsc->pdev);
+ if (!pdsc->bad_pci_warned) {
+ dev_warn(pdsc->dev, "fw not reachable due to failed PCI connection, fw_status = 0x%x\n",
+ fw_status);
+ pdsc->bad_pci_warned = true;
+ }
}

void pdsc_health_thread(struct work_struct *work)
diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index 92d7657dd614..10979118be00 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -165,6 +165,7 @@ struct pdsc {
unsigned long state;
u8 fw_status;
u8 fw_generation;
+ bool bad_pci_warned;
unsigned long last_fw_time;
u32 last_hb;
struct timer_list wdtimer;
diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c
index e494e1298dc9..495ef4ef8c10 100644
--- a/drivers/net/ethernet/amd/pds_core/dev.c
+++ b/drivers/net/ethernet/amd/pds_core/dev.c
@@ -229,6 +229,9 @@ int pdsc_devcmd_reset(struct pdsc *pdsc)
.reset.opcode = PDS_CORE_CMD_RESET,
};

+ if (!pdsc_is_fw_running(pdsc))
+ return 0;
+
return pdsc_devcmd(pdsc, &cmd, &comp, pdsc->devcmd_timeout);
}

--
2.17.1