Re: [EXT] Re: [PATCH net v1] octeon_ep: initialise control mbox tasks before using APIs

From: Michal Schmidt
Date: Tue Dec 05 2023 - 04:00:55 EST


On Tue, Dec 5, 2023 at 7:50 AM Shinas Rasheed <srasheed@xxxxxxxxxxx> wrote:
> > -----Original Message-----
> > > On Sat, Dec 2, 2023 at 4:08 PM Shinas Rasheed <srasheed@xxxxxxxxxxx>
> > wrote:
> > > > Do INIT_WORK for the various workqueue tasks before the first
> > > > invocation of any control net APIs. Since octep_ctrl_net_get_info
> > > > was called before the control net receive work task was even
> > > > initialised, the function call wasn't returning actual firmware
> > > > info queried from Octeon.
> > >
> > > It might be more accurate to say that octep_ctrl_net_get_info depends
> > > on the processing of OEI events. This happens in intr_poll_task.
> > > That's why intr_poll_task needs to be queued earlier.
>
> Intr_poll_task is queued only when the interface is down and the PF cannot catch IRQs as they have been torn down.
> Elsewise, OEI events will trigger the OEI IRQ and consequently its handler.

Right. octep_ctrl_net_get_info is called from the probe function, and
at this point the netdev is not even registered yet. Hence the need
for intr_poll_task.
The reason I started wondering about intr_poll_task is that the commit
message talks about the INIT_WORK, but the patch also moves the
queue_delayed_work call and the reasoning for that move was missing in
the message.
I think the move is correct, but please expand the description.

> Nevertheless, your point is correct in that it
> needs to be queued earlier, but I think subsequently since it calls the control mbox task, that is more relevant and necessary as if it
> is not initialized, it cannot be scheduled even if OEI interrupts have been caught.

OK.

> > > Did octep_send_mbox_req previously always fail with EAGAIN after
> > ^^^^^^^^^^^^^^^^^^^^^
> > I meant octep_ctrl_net_get_info here.
> >
> > > running into the 500 ms timeout in octep_send_mbox_req?
>
> Yes it did, but as it was silent (note that we're not checking any error value), it didn't stop operation. I think I might have to update this patch
> to catch the error values as well (This was a relic from the original code which spawned an extra thread to setup device and hence couldn't give back
> an error value. That implementation was discouraged and we setup things at probe itself in the upstreamed code and can check error values)

Yes, please, catch that error value.

> > > Apropos octep_send_mbox_req... I think it has a race. "d" is put on
> > > the ctrl_req_wait_list after sending the request to the hardware. If
> > > the response arrives quickly, "d" might not yet be on the list when
> > > process_mbox_resp looks for it.
> > > Also, what protects ctrl_req_wait_list from concurrent access?
>
> Such a race condition is, I also think, valid, but is not currently occurring as response, after due processing from Octeon application,
> wouldn't arrive that quickly. Regarding concurrent access, there is currently no protection for ctrl_req_wait_list. Concurrent access here,
> can only happen if either two requests manage to get hold of the ctrl_req_wait_list or a request and a response manages to get hold of the
> ctrl_req_wait_list (the case you stated above).
>
> In the first case, since locks are implemented atop the control mbox itself, requests would have to in effect wait for their chance to
> queue their wait data "d" to ctrl_req_wait_list, avoiding concurrent access.
>
> The second case is valid, but as I stated, wouldn't happen practically. But I suppose we do have to handle all theoretical cases and perhaps
> locking can be done. I suppose a separate patch for it might be better.

Yes, fixing this should be a separate patch.

Thanks,
Michal