Re: [PATCH v6] vfio error recovery: kernel support

From: Alex Williamson
Date: Wed Apr 05 2017 - 18:56:31 EST


On Thu, 6 Apr 2017 01:36:31 +0300
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

> On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
> > On Thu, 6 Apr 2017 00:50:22 +0300
> > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> >
> > > On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> > > > The previous intention of trying to handle all sorts of AER faults
> > > > clearly had more value, though even there the implementation and
> > > > configuration requirements restricted the practicality. For instance
> > > > is AER support actually useful to a customer if it requires all ports
> > > > of a multifunction device assigned to the VM? This seems more like a
> > > > feature targeting whole system partitioning rather than general VM
> > > > device assignment use cases. Maybe that's ok, but it should be a clear
> > > > design decision.
> > >
> > > Alex, what kind of testing do you expect to be necessary?
> > > Would you say testing on real hardware and making it trigger
> > > AER errors is a requirement?
> >
> > Testing various fatal, non-fatal, and corrected errors with aer-inject,
> > especially in multfunction configurations (where more than one port
> > is actually usable) would certainly be required. If we have cases where
> > the driver for a companion function can escalate a non-fatal error to a
> > bus reset, that should be tested, even if it requires temporary hacks to
> > the host driver for the companion function to trigger that case. AER
> > handling is not something that the typical user is going to experience,
> > so it should to be thoroughly tested to make sure it works when needed
> > or there's little point to doing it at all. Thanks,
> >
> > Alex
>
> Some things can be tested within a VM. What would you
> say would be sufficient on a VM and what has to be
> tested on bare metal?

Testing on a VM could be interesting for development, but I'd expect
bare metal for validation, no offense. Bus reset timing can be
different, error propagation can be different, etc. Thanks,

Alex