Re: [PATCH] vfio/pci: Support error recovery
From: Alex Williamson
Date: Wed Dec 14 2016 - 17:47:55 EST
On Thu, 15 Dec 2016 00:20:20 +0200
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> On Tue, Dec 13, 2016 at 08:00:22PM -0700, Alex Williamson wrote:
> > On Wed, 14 Dec 2016 03:58:17 +0200
> > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> >
> > > On Tue, Dec 13, 2016 at 09:27:59AM -0700, Alex Williamson wrote:
> > > > On Tue, 13 Dec 2016 18:12:34 +0200
> > > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> > > >
> > > > > On Mon, Dec 12, 2016 at 08:39:48PM -0700, Alex Williamson wrote:
> > > > > > On Tue, 13 Dec 2016 05:15:13 +0200
> > > > > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > > On Mon, Dec 12, 2016 at 03:43:13PM -0700, Alex Williamson wrote:
> > > > > > > > > So just don't do it then. Topology must match between host and guest,
> > > > > > > > > except maybe for the case of devices with host driver (e.g. PF)
> > > > > > > > > which we might be able to synchronize against.
> > > > > > > >
> > > > > > > > We're talking about host kernel level handling here. The host kernel
> > > > > > > > cannot defer the link reset to the user under the assumption that the
> > > > > > > > user is handling the devices in a very specific way. The moment we do
> > > > > > > > that, we've lost.
> > > > > > >
> > > > > > > The way is same as baremetal though, so why not?
> > > > > >
> > > > > > How do we know this? What if the user is dpdk? The kernel is
> > > > > > responsible for maintaining the integrity of the system and devices,
> > > > > > not the user.
> > > > > >
> > > > > > > And if user doesn't do what's expected, we can
> > > > > > > do the full link reset on close.
> > > > > >
> > > > > > That's exactly my point, if we're talking about multiple devices,
> > > > > > there's no guarantee that the close() for each is simultaneous. If one
> > > > > > function is released before the other we cannot do a bus reset. If
> > > > > > that device is then opened by another user before its sibling is
> > > > > > released, then we once again cannot perform a link reset. I don't
> > > > > > think it would be reasonable to mark the released device quarantined
> > > > > > until the sibling is released, that would be a terrible user experience.
> > > > >
> > > > > Not sure why you find it so terrible, and I don't think there's another way.
> > > >
> > > > If we can't do it without regressing the support we currently have,
> > > > let's not do it at all.
> > >
> > > Why would we regress? As long as there are no unrecoverable errors,
> > > there's no need to change behaviour at all.
> >
> > Currently if a fatal error occurs we allow the host to reset the
> > device, so to the best of our knowledge, the device is always reset.
> > The proposal here allows gaps where we assume a particular guest
> > behavior that allows the device to be returned to the host or opened by
> > other users without that reset. Any plan that relies on a specific
> > user behavior is fundamentally wrong imo.
> >
> > > Alex, do you have a picture of how error recovery can work in your mind?
> > > Your answers seem to imply you do, and these patches don't implement
> > > this correctly. I'm not sure about others, but I for one am unable to
> > > piece it together from the comments you provide. If yes, could you
> > > maybe do a short writeup of an architecture you would be comfortable
> > > with?
> >
> > Clearly I have issues with this skip-the-host-reset plan, I don't think
> > it works. We cannot assume the user will do error recovery.
>
> Absolutely but we can defer recovery until device close.
No we can't, as I've tried to describe multiple times, if the functions
are part of separate groups then they can be opened and closed
asynchronously from each other and we may not have an opportunity where
they are all closed together to perform a reset. The only option I can
see for this is to quarantine the device, which as I've stated seems
like a really poor solution.
> Possibly with userspace invoking an ioctl requesting this,
> to make sure we don't break any legacy setups.
And how do we know that user is not malicious? How do we police
whether they actually perform a reset? We can only track whether a
reset has occurred, which leads back to the quarantine scenario.
> > As I
> > stated previously we should enable the user to do error recovery
> > without depending on the user to do error recovery. I'm a bit lost why
> > we're focusing on this approach when the v9 approach of blocking user
> > access to the device during the host recovery seemed like a better
> > solution. I don't think I've said anything bad about that approach,
> > but it does need further testing and debugging. Nobody seems to be
> > interested in debugging why it wasn't quite working to understand
> > whether that was an implementation issue or a design issue. That's
> > currently the leading approach from my perspective. Thanks,
> >
> > Alex
>
> Hmm this doesn't really write up the architecture. It would be
> really helpful to have that writeup.
Perhaps the patch submitters can create one.
> Do I guess right that what v9 tried to do is:
> - block future userspace accesses
> - do full reset
> - re-enable userspace accesses
>
> ?
>
> The obvious issue with that is that device loses its state apparently at
> random, and this is not something that happens on baremetal, ever.
>
> Thus I don't see how this can work without some PV code in guest.
So there is no link error that can occur on hardware where the device
becomes inaccessible? I'm pretty sure I've experienced cases
otherwise.