Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX

From: Saeed Mahameed
Date: Thu Nov 05 2020 - 18:52:37 EST


On Thu, 2020-11-05 at 12:42 -0800, Jakub Kicinski wrote:
> On Thu, 05 Nov 2020 11:23:54 -0800 Saeed Mahameed wrote:
> > If you report an error without recovering, devlink health will
> > report a
> > bad device state
> >
> > $ ./devlink health
> > pci/0002:01:00.0:
> > reporter npa
> > state error error 1 recover 0
>
> Actually, the counter in the driver is unnecessary, right? Devlink
> counts errors.
>

if you mean error and recover counters, then yes. they are managed by
devlink health

every call to dl-health-report will do:

devlink_health_report(reporter, err_ctx, msg)
{
reproter.error++;

devlink_trigger_event(reporter, msg);

reporter.dump(err_ctx, msg);
reporter.diag(err_ctx);

if (!reporter.recover(err_ctx))
reporter.recover++;
}

so dl-health reports without a recover op will confuse the user if user
sees error count > recover count.

error count should only be grater than recover count when recover
procedure fails which now will indicate the device is not in a healthy
state.

also i want to clarify one small note about devlink dump.

devlink health dump semantics:
on devlink health dump, the devlink health will check if previous dump
exists and will just return it without actually calling the driver, if
not then it will call the driver to perform a new dump and will cache
it.

user has to explicitly clear the devlink health dump of that reporter
in order to allow for newer dump to get generated.

this is done this way because we want the driver to store the dump of
the previously reported errors at the moment the erorrs are reported by
driver, so when a user issue a dump command the dump of the previous
error will be reported to user form memory without the need to access
driver/hw who might be in a bad state.

so this is why using devlink dump for reporting counters doesn't really
work, it will only report the first time the counters are accessed via
devlink health dump, after that it will report the same cached values
over and over until the user clears it up.


> > So you will need to implement an empty recover op.
> > so if these events are informational only and they don't indicate
> > device health issues, why would you report them via devlink health
> > ?
>
> I see devlink health reporters a way of collecting errors reports
> which
> for the most part are just shared with the vendor. IOW firmware (or
> hardware) bugs.
>
> Obviously as you say without recover and additional context in the
> report the value is quite diminished. But _if_ these are indeed
> "report
> me to the vendor" kind of events then at least they should use our
> current mechanics for such reports - which is dl-health.
>
> Without knowing what these events are it's quite hard to tell if
> devlink health is an overkill or counter is sufficient.
>
> Either way - printing these to the logs is definitely the worst
> choice
> :)

Sure, I don't mind using devlink health for dump only, I don't really
have strong feelings against this, they can always extend it in the
future.

it just doesn't make sense to me to have it mainly used for dumping
counters and without using devlik helath utilities, like events,
reports and recover.

so maybe Sunil et al. could polish this patchset and provide more
devlink health support, like diagnose for these errors, dump HW
information and contexts related to these errors so they could debug
root causes, etc ..
Then the use for dl health in this series can be truly justified.