Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX

From: George Cherian
Date: Thu Nov 05 2020 - 08:37:09 EST


Hi Saeed,

Thanks for the review.

> -----Original Message-----
> From: Saeed Mahameed <saeed@xxxxxxxxxx>
> Sent: Thursday, November 5, 2020 10:39 AM
> To: George Cherian <gcherian@xxxxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxx>
> Cc: kuba@xxxxxxxxxx; davem@xxxxxxxxxxxxx; Sunil Kovvuri Goutham
> <sgoutham@xxxxxxxxxxx>; Linu Cherian <lcherian@xxxxxxxxxxx>;
> Geethasowjanya Akula <gakula@xxxxxxxxxxx>; masahiroy@xxxxxxxxxx;
> willemdebruijn.kernel@xxxxxxxxx
> Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health
> reporters for NIX
>
> On Wed, 2020-11-04 at 17:57 +0530, George Cherian wrote:
> > Add health reporters for RVU NPA block.
> ^^^ NIX ?
>
Yes, it's NIX.

> Cc: Jiri
>
> Anyway, could you please spare some words on what is NPA and what is
> NIX?
>
> Regarding the reporters names, all drivers register well known generic names
> such as (fw,hw,rx,tx), I don't know if it is a good idea to use vendor specific
> names, if you are reporting for hw/fw units then just use "hw" or "fw" as the
> reporter name and append the unit NPA/NIX to the counter/error names.
Okay. These are hw units, I will rename them as hw_npa/hw_nix.
>
> > Only reporter dump is supported.
> >
> > Output:
> > # ./devlink health
> > pci/0002:01:00.0:
> > reporter npa
> > state healthy error 0 recover 0
> > reporter nix
> > state healthy error 0 recover 0
> > # ./devlink health dump show pci/0002:01:00.0 reporter nix
> > NIX_AF_GENERAL:
> > Memory Fault on NIX_AQ_INST_S read: 0
> > Memory Fault on NIX_AQ_RES_S write: 0
> > AQ Doorbell error: 0
> > Rx on unmapped PF_FUNC: 0
> > Rx multicast replication error: 0
> > Memory fault on NIX_RX_MCE_S read: 0
> > Memory fault on multicast WQE read: 0
> > Memory fault on mirror WQE read: 0
> > Memory fault on mirror pkt write: 0
> > Memory fault on multicast pkt write: 0
> > NIX_AF_RAS:
> > Poisoned data on NIX_AQ_INST_S read: 0
> > Poisoned data on NIX_AQ_RES_S write: 0
> > Poisoned data on HW context read: 0
> > Poisoned data on packet read from mirror buffer: 0
> > Poisoned data on packet read from mcast buffer: 0
> > Poisoned data on WQE read from mirror buffer: 0
> > Poisoned data on WQE read from multicast buffer: 0
> > Poisoned data on NIX_RX_MCE_S read: 0
> > NIX_AF_RVU:
> > Unmap Slot Error: 0
> >
>
> Now i am a little bit skeptic here, devlink health reporter infrastructure was
> never meant to deal with dump op only, the main purpose is to
> diagnose/dump and recover.
>
> especially in your use case where you only report counters, i don't believe
> devlink health dump is a proper interface for this.
These are not counters. These are error interrupts raised by HW blocks.
The count is provided to understand on how frequently the errors are seen.
Error recovery for some of the blocks happen internally. That is the reason,
Currently only dump op is added.
> Many of these counters if not most are data path packet based and maybe
> they should belong to ethtool.

Regards,
-George