RE: [PATCH 6/6] habanalabs: increase timeout during reset

From: Omer Shpigelman
Date: Mon Mar 30 2020 - 02:14:19 EST


On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <oded.gabbay@xxxxxxxxx> wrote:
> When doing training, the DL framework (e.g. tensorflow) performs hundreds of
> thousands of memory allocations and mappings. In case the driver needs to
> perform hard-reset during training, the driver kills the application and unmaps all
> those memory allocations. Unfortunately, because of that large amount of
> mappings, the driver isn't able to do that in the current timeout (5 seconds).
> Therefore, increase the timeout significantly to 30 seconds to avoid situation
> where the driver resets the device with active mappings, which sometime can
> cause a kernel bug.
>
> BTW, it doesn't mean we will spend all the 30 seconds because the reset thread
> checks every one second if the unmap operation is done.
>
> Signed-off-by: Oded Gabbay <oded.gabbay@xxxxxxxxx>

Reviewed-by: Omer Shpigelman <oshpigelman@xxxxxxxxx>