[PATCH 3/4] habanalabs: complete user context cleanup before hard reset

From: Oded Gabbay
Date: Sat Mar 16 2019 - 16:11:14 EST


From: Omer Shpigelman <oshpigelman@xxxxxxxxx>

This patch fixes a bug which led to a crash during hard reset flow.
Before a hard reset is executed, we wait a few seconds for the user
context cleanup to complete.
If it wasn't completed, we kill the user process and move on to the reset
flow.
Upon killing the user process, the context cleanup flow begins and may
take a while due to MMU unmaps.
Meanwhile, in the driver reset flow, we change the PCI DRAM bar location
which can interfere with the MMU that uses the bar.
If the context cleanup flow didn't finish quickly, a crash may occur due
to PCI DRAM bar mislocation during the MMU unmap.
Hence adding a wait between killing the user process and the start of the
reset flow.

Signed-off-by: Omer Shpigelman <oshpigelman@xxxxxxxxx>
Signed-off-by: Oded Gabbay <oded.gabbay@xxxxxxxxx>
---
drivers/misc/habanalabs/device.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index de46aa6ed154..93d67983ddba 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -11,6 +11,8 @@
#include <linux/sched/signal.h>
#include <linux/hwmon.h>

+#define HL_PLDM_PENDING_RESET_PER_SEC (HL_PENDING_RESET_PER_SEC * 10)
+
bool hl_device_disabled_or_in_reset(struct hl_device *hdev)
{
if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
@@ -462,9 +464,16 @@ static void hl_device_hard_reset_pending(struct work_struct *work)
struct hl_device_reset_work *device_reset_work =
container_of(work, struct hl_device_reset_work, reset_work);
struct hl_device *hdev = device_reset_work->hdev;
- u16 pending_cnt = HL_PENDING_RESET_PER_SEC;
+ u16 pending_total, pending_cnt;
struct task_struct *task = NULL;

+ if (hdev->pldm)
+ pending_total = HL_PLDM_PENDING_RESET_PER_SEC;
+ else
+ pending_total = HL_PENDING_RESET_PER_SEC;
+
+ pending_cnt = pending_total;
+
/* Flush all processes that are inside hl_open */
mutex_lock(&hdev->fd_open_cnt_lock);

@@ -489,6 +498,19 @@ static void hl_device_hard_reset_pending(struct work_struct *work)
}
}

+ pending_cnt = pending_total;
+
+ while ((atomic_read(&hdev->fd_open_cnt)) && (pending_cnt)) {
+
+ pending_cnt--;
+
+ ssleep(1);
+ }
+
+ if (atomic_read(&hdev->fd_open_cnt))
+ dev_crit(hdev->dev,
+ "Going to hard reset with open user contexts\n");
+
mutex_unlock(&hdev->fd_open_cnt_lock);

hl_device_reset(hdev, true, true);
--
2.17.1