Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX

From: Saeed Mahameed
Date: Thu Nov 05 2020 - 14:23:58 EST


On Thu, 2020-11-05 at 09:07 -0800, Jakub Kicinski wrote:
> On Thu, 5 Nov 2020 13:36:56 +0000 George Cherian wrote:
> > > Now i am a little bit skeptic here, devlink health reporter
> > > infrastructure was
> > > never meant to deal with dump op only, the main purpose is to
> > > diagnose/dump and recover.
> > >
> > > especially in your use case where you only report counters, i
> > > don't believe
> > > devlink health dump is a proper interface for this.
> > These are not counters. These are error interrupts raised by HW
> > blocks.
> > The count is provided to understand on how frequently the errors
> > are seen.
> > Error recovery for some of the blocks happen internally. That is
> > the reason,
> > Currently only dump op is added.
>
> The previous incarnation was printing messages to logs, so I assume
> these errors are expected to be relatively low rate.
>
> The point of using devlink health was that you can generate a netlink
> notification when the error happens. IOW you need some calls to
> devlink_health_report() or such.
>
> At least that's my thinking, others may disagree.

If you report an error without recovering, devlink health will report a
bad device state

$ ./devlink health
pci/0002:01:00.0:
reporter npa
state error error 1 recover 0

So you will need to implement an empty recover op.
so if these events are informational only and they don't indicate
device health issues, why would you report them via devlink health ?