Re: [EXT] Re: [PATCH net-next v4 8/8] octeon_ep: add heartbeat monitor

From: Leon Romanovsky
Date: Wed Mar 29 2023 - 03:36:13 EST


On Thu, Mar 23, 2023 at 06:14:10PM +0000, Veerasenareddy Burru wrote:
>
>
> > -----Original Message-----
> > From: Leon Romanovsky <leon@xxxxxxxxxx>
> > Sent: Thursday, March 23, 2023 3:47 AM
> > To: Veerasenareddy Burru <vburru@xxxxxxxxxxx>
> > Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Abhijit Ayarekar
> > <aayarekar@xxxxxxxxxxx>; Sathesh B Edara <sedara@xxxxxxxxxxx>;
> > Satananda Burla <sburla@xxxxxxxxxxx>; linux-doc@xxxxxxxxxxxxxxx; David S.
> > Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>;
> > Jakub Kicinski <kuba@xxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>
> > Subject: [EXT] Re: [PATCH net-next v4 8/8] octeon_ep: add heartbeat
> > monitor
> >
> > External Email
> >
> > ----------------------------------------------------------------------
> > On Wed, Mar 22, 2023 at 02:19:57AM -0700, Veerasenareddy Burru wrote:
> > > Monitor periodic heartbeat messages from device firmware.
> > > Presence of heartbeat indicates the device is active and running.
> > > If the heartbeat is missed for configured interval indicates firmware
> > > has crashed and device is unusable; in this case, PF driver stops and
> > > uninitialize the device.
> > >
> > > Signed-off-by: Veerasenareddy Burru <vburru@xxxxxxxxxxx>
> > > Signed-off-by: Abhijit Ayarekar <aayarekar@xxxxxxxxxxx>
> > > ---
> > > v3 -> v4:
> > > * 0007-xxx.patch in v3 is 0008-xxx.patch in v4.
> > >
> > > v2 -> v3:
> > > * 0009-xxx.patch in v2 is now 0007-xxx.patch in v3 due to
> > > 0007 and 0008.patch from v2 are removed in v3.
> > >
> > > v1 -> v2:
> > > * no change

<...>

> > > + struct octep_device *oct = container_of(work, struct octep_device,
> > > + hb_task.work);
> > > +
> > > + int miss_cnt;
> > > +
> > > + atomic_inc(&oct->hb_miss_cnt);
> > > + miss_cnt = atomic_read(&oct->hb_miss_cnt);
> >
> > miss_cnt = atomic_inc_return(&oct->hb_miss_cnt);
> >
>
> Thanks for the feedback. Will fix it.
>
> > > + if (miss_cnt < oct->conf->max_hb_miss_cnt) {
> >
> > How is this heartbeat working? You increment on every entry to
> > octep_hb_timeout_task(), After max_hb_miss_cnt invocations, you will stop
> > your device.
> >
> > Thanks
> >
>
> Yes, device will be stopped after max_hb_miss_cnt heartbeats are missed.

If I read code correctly, device will stop after octep_hb_timeout_task()
calls which happens every msecs_to_jiffies(oct->conf->hb_interval * 1000.
You don't cancel/resechdule job if timeout doesn't happen.

Thanks

>
> > > + queue_delayed_work(octep_wq, &oct->hb_task,
> > > + msecs_to_jiffies(oct->conf->hb_interval *
> > 1000));