Re: [PATCH] vfio/pci: Support error recovery
From: Michael S. Tsirkin
Date: Wed Dec 14 2016 - 18:00:12 EST
On Wed, Dec 14, 2016 at 03:47:43PM -0700, Alex Williamson wrote:
> On Thu, 15 Dec 2016 00:20:20 +0200
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>
> > On Tue, Dec 13, 2016 at 08:00:22PM -0700, Alex Williamson wrote:
> > > On Wed, 14 Dec 2016 03:58:17 +0200
> > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> > >
> > > > On Tue, Dec 13, 2016 at 09:27:59AM -0700, Alex Williamson wrote:
> > > > > On Tue, 13 Dec 2016 18:12:34 +0200
> > > > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> > > > >
> > > > > > On Mon, Dec 12, 2016 at 08:39:48PM -0700, Alex Williamson wrote:
> > > > > > > On Tue, 13 Dec 2016 05:15:13 +0200
> > > > > > > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > > On Mon, Dec 12, 2016 at 03:43:13PM -0700, Alex Williamson wrote:
> > > > > > > > > > So just don't do it then. Topology must match between host and guest,
> > > > > > > > > > except maybe for the case of devices with host driver (e.g. PF)
> > > > > > > > > > which we might be able to synchronize against.
> > > > > > > > >
> > > > > > > > > We're talking about host kernel level handling here. The host kernel
> > > > > > > > > cannot defer the link reset to the user under the assumption that the
> > > > > > > > > user is handling the devices in a very specific way. The moment we do
> > > > > > > > > that, we've lost.
> > > > > > > >
> > > > > > > > The way is same as baremetal though, so why not?
> > > > > > >
> > > > > > > How do we know this? What if the user is dpdk? The kernel is
> > > > > > > responsible for maintaining the integrity of the system and devices,
> > > > > > > not the user.
> > > > > > >
> > > > > > > > And if user doesn't do what's expected, we can
> > > > > > > > do the full link reset on close.
> > > > > > >
> > > > > > > That's exactly my point, if we're talking about multiple devices,
> > > > > > > there's no guarantee that the close() for each is simultaneous. If one
> > > > > > > function is released before the other we cannot do a bus reset. If
> > > > > > > that device is then opened by another user before its sibling is
> > > > > > > released, then we once again cannot perform a link reset. I don't
> > > > > > > think it would be reasonable to mark the released device quarantined
> > > > > > > until the sibling is released, that would be a terrible user experience.
> > > > > >
> > > > > > Not sure why you find it so terrible, and I don't think there's another way.
> > > > >
> > > > > If we can't do it without regressing the support we currently have,
> > > > > let's not do it at all.
> > > >
> > > > Why would we regress? As long as there are no unrecoverable errors,
> > > > there's no need to change behaviour at all.
> > >
> > > Currently if a fatal error occurs we allow the host to reset the
> > > device, so to the best of our knowledge, the device is always reset.
> > > The proposal here allows gaps where we assume a particular guest
> > > behavior that allows the device to be returned to the host or opened by
> > > other users without that reset. Any plan that relies on a specific
> > > user behavior is fundamentally wrong imo.
> > >
> > > > Alex, do you have a picture of how error recovery can work in your mind?
> > > > Your answers seem to imply you do, and these patches don't implement
> > > > this correctly. I'm not sure about others, but I for one am unable to
> > > > piece it together from the comments you provide. If yes, could you
> > > > maybe do a short writeup of an architecture you would be comfortable
> > > > with?
> > >
> > > Clearly I have issues with this skip-the-host-reset plan, I don't think
> > > it works. We cannot assume the user will do error recovery.
> >
> > Absolutely but we can defer recovery until device close.
>
> No we can't, as I've tried to describe multiple times, if the functions
> are part of separate groups then they can be opened and closed
> asynchronously from each other and we may not have an opportunity where
> they are all closed together to perform a reset.
> The only option I can
> see for this is to quarantine the device, which as I've stated seems
> like a really poor solution.
>
> > Possibly with userspace invoking an ioctl requesting this,
> > to make sure we don't break any legacy setups.
>
> And how do we know that user is not malicious? How do we police
> whether they actually perform a reset? We can only track whether a
> reset has occurred, which leads back to the quarantine scenario.
I don't really know what do you call the quarantine scenario.
Malicious users get non working devices.
Why is this poor?
> > > As I
> > > stated previously we should enable the user to do error recovery
> > > without depending on the user to do error recovery. I'm a bit lost why
> > > we're focusing on this approach when the v9 approach of blocking user
> > > access to the device during the host recovery seemed like a better
> > > solution. I don't think I've said anything bad about that approach,
> > > but it does need further testing and debugging. Nobody seems to be
> > > interested in debugging why it wasn't quite working to understand
> > > whether that was an implementation issue or a design issue. That's
> > > currently the leading approach from my perspective. Thanks,
> > >
> > > Alex
> >
> > Hmm this doesn't really write up the architecture. It would be
> > really helpful to have that writeup.
>
> Perhaps the patch submitters can create one.
You already spent so many words on this - won't it
be easier for you to just write up what you consider
the right thing to do?
> > Do I guess right that what v9 tried to do is:
> > - block future userspace accesses
> > - do full reset
> > - re-enable userspace accesses
> >
> > ?
> >
> > The obvious issue with that is that device loses its state apparently at
> > random, and this is not something that happens on baremetal, ever.
> >
> > Thus I don't see how this can work without some PV code in guest.
>
> So there is no link error that can occur on hardware where the device
> becomes inaccessible? I'm pretty sure I've experienced cases
> otherwise.
There are errors like this (losing link) but this is IMO a major error
which isn't covered by AER.
It's possible that some guests have AER code that happens to recover
from lost link, but IMO that's never guaranteed.
Converting an uncorrectable error like a lost write into link loss
means escalating the problem to a whole different level.
--
MST