Re: [PATCH] net/mlx5: Added cond_resched() to crdump collection

From: Mohamed Khalfella
Date: Fri Aug 23 2024 - 13:41:35 EST


On 2024-08-23 08:16:32 +0300, Moshe Shemesh wrote:
>
>
> On 8/23/2024 7:08 AM, Przemek Kitszel wrote:
> >
> > On 8/22/24 19:08, Mohamed Khalfella wrote:
> >> On 2024-08-22 09:40:21 +0300, Moshe Shemesh wrote:
> >>>
> >>>
> >>> On 8/21/2024 1:27 AM, Mohamed Khalfella wrote:
> >>>>
> >>>> On 2024-08-20 12:09:37 +0200, Przemek Kitszel wrote:
> >>>>> On 8/19/24 23:42, Mohamed Khalfella wrote:
> >
> >
> >>>>
> >>>> Putting a cond_resched() every 16 register reads, similar to
> >>>> mlx5_vsc_wait_on_flag(), should be okay. With the numbers above, this
> >>>> will result in cond_resched() every ~0.56ms, which is okay IMO.
> >>>
> >>> Sorry for the late response, I just got back from vacation.
> >>> All your measures looks right.
> >>> crdump is the devlink health dump of mlx5 FW fatal health reporter.
> >>> In the common case since auto-dump and auto-recover are default for this
> >>> health reporter, the crdump will be collected on fatal error of the mlx5
> >>> device and the recovery flow waits for it and run right after crdump
> >>> finished.
> >>> I agree with adding cond_resched(), but I would reduce the frequency,
> >>> like once in 1024 iterations of register read.
> >>> mlx5_vsc_wait_on_flag() is a bit different case as the usleep there is
> >>> after 16 retries waiting for the value to change.
> >>> Thanks.
> >>
> >> Thanks for taking a look. Once in every 1024 iterations approximately
> >> translates to 35284.4ns * 1024 ~= 36.1ms, which is relatively long time
> >> IMO. How about any power-of-two <= 128 (~4.51ms)?
>
> OK
> >
> > Such tune-up would matter for interactive use of the machine with very
> > little cores, is that the case? Otherwise I see no point [to make it
> > overall a little slower, as that is the tradeoff].
>
> Yes, as I see it, the point here is host with very few cores.

It should make a difference for systems with few cores. Add to that the
numbers above is what I was able to get from the lab. It has been seen
in the field that collecting crdump takes more than 5 seconds causing
issues. If this makes sense I will submit v2 with the updated commit
message and cond_resched() every 128 iterations of register read.