RE: [PATCH 6/6] habanalabs: increase timeout during reset

From: Omer Shpigelman
Date: Mon Mar 30 2020 - 02:14:19 EST

Next message: Guenter Roeck: "Re: çå: çå: [v2,1/1] hwmon: (nct7904) Add watchdog function"
Previous message: Stephen Rothwell: "linux-next: build failure after merge of the gpio tree"
In reply to: Oded Gabbay: "[PATCH 6/6] habanalabs: increase timeout during reset"
Next in thread: Oded Gabbay: "[PATCH 4/6] habanalabs: unify and improve device cpu init"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <oded.gabbay@xxxxxxxxx> wrote:
> When doing training, the DL framework (e.g. tensorflow) performs hundreds of
> thousands of memory allocations and mappings. In case the driver needs to
> perform hard-reset during training, the driver kills the application and unmaps all
> those memory allocations. Unfortunately, because of that large amount of
> mappings, the driver isn't able to do that in the current timeout (5 seconds).
> Therefore, increase the timeout significantly to 30 seconds to avoid situation
> where the driver resets the device with active mappings, which sometime can
> cause a kernel bug.
>
> BTW, it doesn't mean we will spend all the 30 seconds because the reset thread
> checks every one second if the unmap operation is done.
>
> Signed-off-by: Oded Gabbay <oded.gabbay@xxxxxxxxx>

Reviewed-by: Omer Shpigelman <oshpigelman@xxxxxxxxx>

Next message: Guenter Roeck: "Re: çå: çå: [v2,1/1] hwmon: (nct7904) Add watchdog function"
Previous message: Stephen Rothwell: "linux-next: build failure after merge of the gpio tree"
In reply to: Oded Gabbay: "[PATCH 6/6] habanalabs: increase timeout during reset"
Next in thread: Oded Gabbay: "[PATCH 4/6] habanalabs: unify and improve device cpu init"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]