Am 28.12.24 um 07:32 schrieb Shuai Xue:
It's observed that most GPU jobs utilize less than one server, typically
with each GPU being used by an independent job. If a job consumed poisoned
data, a SIGBUS signal will be sent to terminate it. Meanwhile, the
gpu_recovery parameter is set to -1 by default, the amdgpu driver resets
all GPUs on the server. As a result, all jobs are terminated. Setting
gpu_recovery to 0 provides an opportunity to preemptively evacuate other
jobs and subsequently manually reset all GPUs.
*BIG* NAK to this whole approach!
Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done.
This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data.
Please don't tell me that you tried to use this in a production environment.
Regards,
Christian.
XID 94: Contained ECC error
XID 95: UnContained ECC error
For Xid 94, these errors are contained to one application, and the application
that encountered this error must be restarted. All other applications running
at the time of the Xid are unaffected. It is recommended to reset the GPU when
convenient. Applications can continue to be run until the reset can be
performed.
For Xid 95, these errors affect multiple applications, and the affected GPU
must be reset before applications can restart.
https://docs.nvidia.com/deploy/xid-errors/
However, this parameter is
read-only, necessitating correct settings at driver load. And reloading the
GPU driver in a production environment can be challenging due to reference
counts maintained by various monitoring services.
Set the gpu_recovery parameter with read-write permission to enable runtime
modification. It will enables users to dynamically manage GPU recovery
mechanisms based on real-time requirements or conditions.
Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 38686203bea6..03dd902e1cec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
+static int amdgpu_set_gpu_recovery(const char *buf,
+ const struct kernel_param *kp)
+{
+ unsigned long val;
+ int ret;
+
+ ret = kstrtol(buf, 10, &val);
+ if (ret < 0)
+ return ret;
+
+ if (val != 1 && val != 0 && val != -1) {
+ pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n",
+ val);
+ return -EINVAL;
+ }
+
+ return param_set_int(buf, kp);
+}
+
+static const struct kernel_param_ops amdgpu_gpu_recovery_ops = {
+ .set = amdgpu_set_gpu_recovery,
+ .get = param_get_int,
+};
+
/**
* DOC: gpu_recovery (int)
* Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV).
*/
MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
-module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
+module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644);
/**
* DOC: emu_mode (int)