Re: [PATCH] vfio/pci: Support error recovery
From: Cao jin
Date: Fri Dec 09 2016 - 02:47:47 EST
On 12/08/2016 10:46 PM, Cao jin wrote:
> On 12/06/2016 11:35 PM, Alex Williamson wrote:
>> On Tue, 6 Dec 2016 18:46:04 +0800
>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>> On 12/06/2016 12:59 PM, Alex Williamson wrote:
>>>> On Tue, 6 Dec 2016 05:55:28 +0200
>>>> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>>>>> On Mon, Dec 05, 2016 at 09:17:30AM -0700, Alex Williamson wrote:
>>>>>> If you're going to take the lead for these AER patches, I would
>>>>>> certainly suggest that understanding the reasoning behind the bus reset
>>>>>> behavior is a central aspect to this series. This effort has dragged
>>>>>> out for nearly two years and I apologize, but I don't really have a lot
>>>>>> of patience for rehashing some of these issues if you're not going to
>>>>>> read the previous discussions or consult with your colleagues to
>>>>>> understand how we got to this point. If you want to challenge some of
>>>>>> the design points, that's great, it could use some new eyes, but please
>>>>>> understand how we got here first.
>>>>> Well I'm guessing Cao jin here isn't the only one not
>>>>> willing to plough through all historical versions of the patchset
>>>>> just to figure out the motivation for some code.
>>>>> Including a summary of a high level architecture couldn't hurt.
>>>>> Any chance of writing such? Alternatively, we can try to build it as
>>>>> part of this thread. Shouldn't be hard as it seems somewhat
>>>>> straight-forward on the surface:
>>>>> - detect link error on the host, don't reset link as we would normally do
>>>> This is actually a new approach that I'm not sure I agree with. By
>>>> skipping the host directed link reset, vfio is taking responsibility
>>>> for doing this, but then we just assume the user will do it. I have
>>>> issues with this.
>>>> The previous approach was to use the error detected notifier to block
>>>> access to the device, allowing the host to perform the link reset. A
>>>> subsequent notification in the AER process released the user access
>>>> which allowed the user AER process to proceed. This did result in both
>>>> a host directed and a guest directed link reset, but other than
>>>> coordinating the blocking of the user process during host reset, that
>>>> hasn't been brought up as an issue previously.
>>> Tests on previous versions didn't bring up issues as I find, I think
>>> that is because we didn't test it completely. As I know, before August
>>> of this year, we didn't have cable connected to NIC, let alone
>>> connecting NIC to gateway.
>> Lack of testing has been a significant issue throughout the development
>> of this series.
>>> Even if I fixed the guest oops issue in igb driver that Alex found in
>>> v9, v9 still cannot work in my test. And in my test, disable link
>>> reset(in host) in aer core for vfio-pci is the most significant step to
>>> get my test passed.
>> But is it the correct step? I'm not convinced. Why did blocking guest
>> access not work? How do you plan to manage vfio taking the
>> responsibility to perform a bus reset when you don't know whether QEMU
>> is the user of the device or whether the user supports AER recovery?
> Maybe currently we don't have enough proof to prove the correctness, but
> I think I did find some facts to prove that link reset in host is a big
> trouble, and can answer part of questions above.
> 1st, some thoughts:
> In pci-error-recovery.txt and do_recovery() of kernel tree, we can see,
> a recovery consists of several steps(callbacks), link reset is one of
> them, and except link reset, the others are seems kind of device
> specific. In our case, both host & guest will do recovery, I think the
> host recovery actually is some kind of fake recovery, see vfio-pci
> driver's error_detected & resume callback, they don't do anything
> special, mainly signal error to user, but the link reset in host "fake
> reset" does some serious work, in other words, I think host does the
> recovery incompletely, so I was thinking, why not just drop incompletely
> host recovery(drop link reset) for vfio-pci, and let the guest take care
> of the whole serious recovery. This is part of the reason of why my
> version looks like this. But yes, I admit the issue Alex mentioned,
> vfio can't guarantee that user will do a bus reset, this is an issue I
> will keep looking for a solution.
> 2nd, some facts and analyzation from test:
> In host, the relationship between time and behviour in each component
> roughly looks as following:
> + HW + host kernel + qemu + guest kernel +
> | |(error recovery)| | |
> | | | | |
> | | vfio-pci's | | |
> | | error_detected | | |
> | | + | | |
> | | | | | |
> | | | error notify | | |
> | | | via eventfd | | |
> | | +---------------> +----------+ | |
> | | | +vfio_err_ | | |
> | | | |notifier_ | | |
> | +---- +<---+link reset | |handler | | |
> | | HW | | | | | | |
> | | | | | | | |
> | |r | | vfio-pci's | |pass aer | | |
> | |e.. | | resume | |to guest | | |
> | |s. | | (blocking end) | | | | |
> | |e | | | | *2* | | |
> | |t | | | +----+-----+ | |
> | |i | | | | | |
> | |n | | | +--------> +----------+ |
> | |g | | | | | guest | |
> | | | | | | | recovery | |
> | | | | | | | process | |
> | | | | | | |(include | |
> | | | | | | |register | |
> | | *1* | | | | |access) | |
> | | | | | | | | |
> | | | | | | | *3* | |
> | | | | +----------+ |
> Time | | | | |
> | | | | |
> | | | | |
> | | | | |
> v | | | |
> Now let me try to answer: Why did blocking guest access not work?
> Some important factor:
> 1. host recovery doesn't do anything special except error notifying, so
> it may be executed very fast.
> 2. Hardware resetting time is not sure, from pcie spec 6.6.1, guessing
> it need many ms, pretty long?
> some facts found in v9(block config write, not read, during host
> recovery) test:
> 1. reading uncor error register in vfio_err_notifier_handler sometimes
> returns correct value, sometimes return invalid value(All F's)
I forget another facts:
2. the igb bug[*] our patch v9 triggered also is the same kind things.
error_detected() of igb in guest read config space, and return invalid
all F's, then igb NULLed its hardware address, then oops shows.
I think it is also a concurrent issue as described below.
> So, I am thinking, if host blocking on host end early, and *2*, *3* is
> parallel with *1*, the way used in v9 to blocking guest access, may not