Re: [PATCH v6] vfio error recovery: kernel support
From: Alex Williamson
Date: Thu Apr 06 2017 - 11:31:35 EST
On Thu, 6 Apr 2017 16:49:35 +0800
Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
> On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:
> >> Apparently, I don't have experience to induce non-fatal error, device
> >> error is more of a chance related with the environment(temperature,
> >> humidity, etc) as I understand.
> >
> > I'm not sure how to interpret this statement. I think what Alex is
> > saying is simply that patches should include some justification. They
> > make changes but what are they improving?
> > For example:
> >
> > I tested device ABC in conditions DEF. Without a patch VM
> > stops. With the patches applied VM recovers and proceeds to
> > use the device normally.
> >
> > is one reasonable justification imho.
> >
>
> Got it. But unfortunately, until now, I haven't seen a VM stop caused by
> a real device non-fatal error during device assignment(Only saw real
> fatal errors after start VM).
> On one side, AER error could occur theoretically; on the other side,
> seldom people have seen a VM stop caused by AER. Now I am asked that do
> I have a real evidence or scenario to prove that this patchset is really
> useful? I don't, and we all know it is hard to trigger a real hardware
> error, so, seems I am pushed into the corner. I guess these questions
> also apply for AER driver's author, if the scenario is easy to
> reproduce, there is no need to write aer_inject to fake errors.
Yes, non-fatal errors are rare, so rare in fact that I wonder if
anything actually implements them. Could we take a survey of Linux
drivers with AER support and see which ones actually do anything useful
on a non-fatal error? I know fatal errors exist, I've seen them. The
situation I want to avoid is adding non-fatal AER support that never
gets used and just gets in the way of the next attempt to support fatal
error recovery. We're extending the user ABI here, which means that
for any feature we add, we need to consider how we'll support it long
term and how we'll deprecate it or further extend it when the next
level of support comes. I'm reluctant to add code that only partially
solves a problem and only addresses the side of the problem that we're
not even sure occurs. Thanks,
Alex