RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA

From: George Cherian
Date: Tue Dec 01 2020 - 00:19:39 EST


Jakub,

> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 9:06 AM
> To: Jakub Kicinski <kuba@xxxxxxxxxx>
> Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; Sunil Kovvuri Goutham <sgoutham@xxxxxxxxxxx>;
> Linu Cherian <lcherian@xxxxxxxxxxx>; Geethasowjanya Akula
> <gakula@xxxxxxxxxxx>; masahiroy@xxxxxxxxxx;
> willemdebruijn.kernel@xxxxxxxxx; saeed@xxxxxxxxxx; jiri@xxxxxxxxxxx
> Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
>
> Hi Jakub,
>
> > -----Original Message-----
> > From: Jakub Kicinski <kuba@xxxxxxxxxx>
> > Sent: Tuesday, December 1, 2020 7:59 AM
> > To: George Cherian <gcherian@xxxxxxxxxxx>
> > Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > davem@xxxxxxxxxxxxx; Sunil Kovvuri Goutham
> <sgoutham@xxxxxxxxxxx>;
> > Linu Cherian <lcherian@xxxxxxxxxxx>; Geethasowjanya Akula
> > <gakula@xxxxxxxxxxx>; masahiroy@xxxxxxxxxx;
> > willemdebruijn.kernel@xxxxxxxxx; saeed@xxxxxxxxxx; jiri@xxxxxxxxxxx
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > Add health reporters for RVU NPA block.
> > > NPA Health reporters handle following HW event groups
> > > - GENERAL events
> > > - ERROR events
> > > - RAS events
> > > - RVU event
> > > An event counter per event is maintained in SW.
> > >
> > > Output:
> > > # devlink health
> > > pci/0002:01:00.0:
> > > reporter hw_npa
> > > state healthy error 0 recover 0 # devlink health dump show
> > > pci/0002:01:00.0 reporter hw_npa
> > > NPA_AF_GENERAL:
> > > Unmap PF Error: 0
> > > NIX:
> > > 0: free disabled RX: 0 free disabled TX: 0
> > > 1: free disabled RX: 0 free disabled TX: 0
> > > Free Disabled for SSO: 0
> > > Free Disabled for TIM: 0
> > > Free Disabled for DPI: 0
> > > Free Disabled for AURA: 0
> > > Alloc Disabled for Resvd: 0
> > > NPA_AF_ERR:
> > > Memory Fault on NPA_AQ_INST_S read: 0
> > > Memory Fault on NPA_AQ_RES_S write: 0
> > > AQ Doorbell Error: 0
> > > Poisoned data on NPA_AQ_INST_S read: 0
> > > Poisoned data on NPA_AQ_RES_S write: 0
> > > Poisoned data on HW context read: 0
> > > NPA_AF_RVU:
> > > Unmap Slot Error: 0
> >
> > You seem to have missed the feedback Saeed and I gave you on v2.
> >
> > Did you test this with the errors actually triggering? Devlink should
> > store only
> Yes, the same was tested using devlink health test interface by injecting
> errors.
> The dump gets generated automatically and the counters do get out of sync,
> in case of continuous error.
> That wouldn't be much of an issue as the user could manually trigger a dump
> clear and Re-dump the counters to get the exact status of the counters at
> any point of time.

Now that recover op is added the devlink error counter and recover counter will be
proper. The internal counter for each event is needed just to understand within a specific reporter, how
many such events occurred.

Following is the log snippet of the devlink health test being done on hw_nix reporter.
# for i in `seq 1 33` ; do devlink health test pci/0002:01:00.0 reporter hw_nix; done
//Inject 33 errors (16 of NIX_AF_RVU and 17 of NIX_AF_RAS and NIX_AF_GENERAL errors)
# devlink health
pci/0002:01:00.0:
reporter hw_npa
state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
reporter hw_nix
state healthy error 250 recover 250 last_dump_date 1970-01-01 last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
# devlink health dump show pci/0002:01:00.0 reporter hw_nix
NIX_AF_GENERAL:
Memory Fault on NIX_AQ_INST_S read: 1
Memory Fault on NIX_AQ_RES_S write: 1
AQ Doorbell error: 1
Rx on unmapped PF_FUNC: 1
Rx multicast replication error: 1
Memory fault on NIX_RX_MCE_S read: 1
Memory fault on multicast WQE read: 1
Memory fault on mirror WQE read: 1
Memory fault on mirror pkt write: 1
Memory fault on multicast pkt write: 1
NIX_AF_RAS:
Poisoned data on NIX_AQ_INST_S read: 1
Poisoned data on NIX_AQ_RES_S write: 1
Poisoned data on HW context read: 1
Poisoned data on packet read from mirror buffer: 1
Poisoned data on packet read from mcast buffer: 1
Poisoned data on WQE read from mirror buffer: 1
Poisoned data on WQE read from multicast buffer: 1
Poisoned data on NIX_RX_MCE_S read: 1
NIX_AF_RVU:
Unmap Slot Error: 0
# devlink health dump clear pci/0002:01:00.0 reporter hw_nix
# devlink health dump show pci/0002:01:00.0 reporter hw_nix
NIX_AF_GENERAL:
Memory Fault on NIX_AQ_INST_S read: 17
Memory Fault on NIX_AQ_RES_S write: 17
AQ Doorbell error: 17
Rx on unmapped PF_FUNC: 17
Rx multicast replication error: 17
Memory fault on NIX_RX_MCE_S read: 17
Memory fault on multicast WQE read: 17
Memory fault on mirror WQE read: 17
Memory fault on mirror pkt write: 17
Memory fault on multicast pkt write: 17
NIX_AF_RAS:
Poisoned data on NIX_AQ_INST_S read: 17
Poisoned data on NIX_AQ_RES_S write: 17
Poisoned data on HW context read: 17
Poisoned data on packet read from mirror buffer: 17
Poisoned data on packet read from mcast buffer: 17
Poisoned data on WQE read from mirror buffer: 17
Poisoned data on WQE read from multicast buffer: 17
Poisoned data on NIX_RX_MCE_S read: 17
NIX_AF_RVU:
Unmap Slot Error: 16
>
> > one dump, are the counters not going to get out of sync unless
> > something clears the dump every time it triggers?
Also, note that auto_dump is something which can be turned off by user.
# devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false
So that user can dump whenever required, which will always return the correct counter values.

>
> Regards,
> -George