Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
From: George Cherian
Date: Mon Nov 30 2020 - 22:37:19 EST
Hi Jakub,
> -----Original Message-----
> From: Jakub Kicinski <kuba@xxxxxxxxxx>
> Sent: Tuesday, December 1, 2020 7:59 AM
> To: George Cherian <gcherian@xxxxxxxxxxx>
> Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; Sunil Kovvuri Goutham <sgoutham@xxxxxxxxxxx>;
> Linu Cherian <lcherian@xxxxxxxxxxx>; Geethasowjanya Akula
> <gakula@xxxxxxxxxxx>; masahiroy@xxxxxxxxxx;
> willemdebruijn.kernel@xxxxxxxxx; saeed@xxxxxxxxxx; jiri@xxxxxxxxxxx
> Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
>
> On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > Add health reporters for RVU NPA block.
> > NPA Health reporters handle following HW event groups
> > - GENERAL events
> > - ERROR events
> > - RAS events
> > - RVU event
> > An event counter per event is maintained in SW.
> >
> > Output:
> > # devlink health
> > pci/0002:01:00.0:
> > reporter hw_npa
> > state healthy error 0 recover 0
> > # devlink health dump show pci/0002:01:00.0 reporter hw_npa
> > NPA_AF_GENERAL:
> > Unmap PF Error: 0
> > NIX:
> > 0: free disabled RX: 0 free disabled TX: 0
> > 1: free disabled RX: 0 free disabled TX: 0
> > Free Disabled for SSO: 0
> > Free Disabled for TIM: 0
> > Free Disabled for DPI: 0
> > Free Disabled for AURA: 0
> > Alloc Disabled for Resvd: 0
> > NPA_AF_ERR:
> > Memory Fault on NPA_AQ_INST_S read: 0
> > Memory Fault on NPA_AQ_RES_S write: 0
> > AQ Doorbell Error: 0
> > Poisoned data on NPA_AQ_INST_S read: 0
> > Poisoned data on NPA_AQ_RES_S write: 0
> > Poisoned data on HW context read: 0
> > NPA_AF_RVU:
> > Unmap Slot Error: 0
>
> You seem to have missed the feedback Saeed and I gave you on v2.
>
> Did you test this with the errors actually triggering? Devlink should store only
Yes, the same was tested using devlink health test interface by injecting errors.
The dump gets generated automatically and the counters do get out of sync,
in case of continuous error.
That wouldn't be much of an issue as the user could manually trigger a dump clear and
Re-dump the counters to get the exact status of the counters at any point of time.
> one dump, are the counters not going to get out of sync unless something
> clears the dump every time it triggers?
Regards,
-George