Re: [PATCH v13 0/6] Address error and recovery for AER and DPC

From: poza
Date: Mon Apr 16 2018 - 10:12:52 EST


On 2018-04-16 18:57, Bjorn Helgaas wrote:
On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@xxxxxxxxxxxxxx wrote:
On 2018-04-16 09:23, Sinan Kaya wrote:
> On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
> > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
> > > This patch set brings in error handling support for DPC
> > >
> > > The current implementation of AER and error message broadcasting
> > > to the
> > > EP driver is tightly coupled and limited to AER service driver.
> > > It is important to factor out broadcasting and other link handling
> > > callbacks. So that not only when AER gets triggered, but also
> > > when DPC get
> > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> > >
> > > DPC should behave identical to AER as far as error handling is
> > > concerned.
> > > DPC should remove the devices and not to do recovery for hotplug
> > > enabled system.
> >
> > Is there a specific bug that's fixed by these patches? I didn't see
> > one mentioned in the changelogs.
> >
>
> There is no actual bug.
>
> We realized that DPC and hotplug is heavily integrated today. We
> have use cases for systems without hotplug support but still
> support DPC. That's the problem we are trying to solve with this
> patchset.

Apparently there's a problem with systems that have DPC but not
hotplug. It will be extremely helpful if you can articulate what that
problem is and include it in the appropriate changelog.

Adding to what Sinan said;

DPC should handle the error handling and recovery similar to AER,
because finally both are attempting recovery in some or the other
way, and for that error handling and recovery framework has to be
loosely coupled. It achieves uniformity and transparency to the
error handling agents such as AER, DPC, with respect to recovery and
error handling.

So, this patch-set tries to unify lot of things between error agents
and make them behave in a well defined way. (be it error (FATAL,
NON_FATAL) handling or recovery).

I totally support this objective.

Thanks Bjorn, I will include this objective in Changelog along with Sinan's text.
I am not clear on one last thing Bjorn; which is;
do we need last patch ? patch-6 which handles hotplug case.
Also I think we could take this patch-set as basic changes/attempt to unify the code which it does.

And, in the next follow-up patches we can improve upon the things such as,
whether to do different actions for FATAL cases and NON_FATAL cases. And then I can make needed changes to AER and DPC
Please let me know how this sounds.


Bjorn