Re: [PATCH] vfio/pci: Support error recovery

From: Alex Williamson
Date: Tue Dec 13 2016 - 22:00:48 EST


On Wed, 14 Dec 2016 03:58:17 +0200
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

> On Tue, Dec 13, 2016 at 09:27:59AM -0700, Alex Williamson wrote:
> > On Tue, 13 Dec 2016 18:12:34 +0200
> > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> >
> > > On Mon, Dec 12, 2016 at 08:39:48PM -0700, Alex Williamson wrote:
> > > > On Tue, 13 Dec 2016 05:15:13 +0200
> > > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> > > >
> > > > > On Mon, Dec 12, 2016 at 03:43:13PM -0700, Alex Williamson wrote:
> > > > > > > So just don't do it then. Topology must match between host and guest,
> > > > > > > except maybe for the case of devices with host driver (e.g. PF)
> > > > > > > which we might be able to synchronize against.
> > > > > >
> > > > > > We're talking about host kernel level handling here. The host kernel
> > > > > > cannot defer the link reset to the user under the assumption that the
> > > > > > user is handling the devices in a very specific way. The moment we do
> > > > > > that, we've lost.
> > > > >
> > > > > The way is same as baremetal though, so why not?
> > > >
> > > > How do we know this? What if the user is dpdk? The kernel is
> > > > responsible for maintaining the integrity of the system and devices,
> > > > not the user.
> > > >
> > > > > And if user doesn't do what's expected, we can
> > > > > do the full link reset on close.
> > > >
> > > > That's exactly my point, if we're talking about multiple devices,
> > > > there's no guarantee that the close() for each is simultaneous. If one
> > > > function is released before the other we cannot do a bus reset. If
> > > > that device is then opened by another user before its sibling is
> > > > released, then we once again cannot perform a link reset. I don't
> > > > think it would be reasonable to mark the released device quarantined
> > > > until the sibling is released, that would be a terrible user experience.
> > >
> > > Not sure why you find it so terrible, and I don't think there's another way.
> >
> > If we can't do it without regressing the support we currently have,
> > let's not do it at all.
>
> Why would we regress? As long as there are no unrecoverable errors,
> there's no need to change behaviour at all.

Currently if a fatal error occurs we allow the host to reset the
device, so to the best of our knowledge, the device is always reset.
The proposal here allows gaps where we assume a particular guest
behavior that allows the device to be returned to the host or opened by
other users without that reset. Any plan that relies on a specific
user behavior is fundamentally wrong imo.

> Alex, do you have a picture of how error recovery can work in your mind?
> Your answers seem to imply you do, and these patches don't implement
> this correctly. I'm not sure about others, but I for one am unable to
> piece it together from the comments you provide. If yes, could you
> maybe do a short writeup of an architecture you would be comfortable
> with?

Clearly I have issues with this skip-the-host-reset plan, I don't think
it works. We cannot assume the user will do error recovery. As I
stated previously we should enable the user to do error recovery
without depending on the user to do error recovery. I'm a bit lost why
we're focusing on this approach when the v9 approach of blocking user
access to the device during the host recovery seemed like a better
solution. I don't think I've said anything bad about that approach,
but it does need further testing and debugging. Nobody seems to be
interested in debugging why it wasn't quite working to understand
whether that was an implementation issue or a design issue. That's
currently the leading approach from my perspective. Thanks,

Alex