Re: [PATCH] vfio/pci: Support error recovery
From: Cao jin
Date: Fri Dec 09 2016 - 02:47:42 EST
On 12/09/2016 12:30 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 08, 2016 at 10:46:59PM +0800, Cao jin wrote:
>> On 12/06/2016 11:35 PM, Alex Williamson wrote:
>>> On Tue, 6 Dec 2016 18:46:04 +0800
>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>> On 12/06/2016 12:59 PM, Alex Williamson wrote:
>>>>> On Tue, 6 Dec 2016 05:55:28 +0200
>>>>> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>>>>>> On Mon, Dec 05, 2016 at 09:17:30AM -0700, Alex Williamson wrote:
>>>>>>> If you're going to take the lead for these AER patches, I would
>>>>>>> certainly suggest that understanding the reasoning behind the bus reset
>>>>>>> behavior is a central aspect to this series. This effort has dragged
>>>>>>> out for nearly two years and I apologize, but I don't really have a lot
>>>>>>> of patience for rehashing some of these issues if you're not going to
>>>>>>> read the previous discussions or consult with your colleagues to
>>>>>>> understand how we got to this point. If you want to challenge some of
>>>>>>> the design points, that's great, it could use some new eyes, but please
>>>>>>> understand how we got here first.
>>>>>> Well I'm guessing Cao jin here isn't the only one not
>>>>>> willing to plough through all historical versions of the patchset
>>>>>> just to figure out the motivation for some code.
>>>>>> Including a summary of a high level architecture couldn't hurt.
>>>>>> Any chance of writing such? Alternatively, we can try to build it as
>>>>>> part of this thread. Shouldn't be hard as it seems somewhat
>>>>>> straight-forward on the surface:
>>>>>> - detect link error on the host, don't reset link as we would normally do
>>>>> This is actually a new approach that I'm not sure I agree with. By
>>>>> skipping the host directed link reset, vfio is taking responsibility
>>>>> for doing this, but then we just assume the user will do it. I have
>>>>> issues with this.
>>>>> The previous approach was to use the error detected notifier to block
>>>>> access to the device, allowing the host to perform the link reset. A
>>>>> subsequent notification in the AER process released the user access
>>>>> which allowed the user AER process to proceed. This did result in both
>>>>> a host directed and a guest directed link reset, but other than
>>>>> coordinating the blocking of the user process during host reset, that
>>>>> hasn't been brought up as an issue previously.
>>>> Tests on previous versions didn't bring up issues as I find, I think
>>>> that is because we didn't test it completely. As I know, before August
>>>> of this year, we didn't have cable connected to NIC, let alone
>>>> connecting NIC to gateway.
>>> Lack of testing has been a significant issue throughout the development
>>> of this series.
>>>> Even if I fixed the guest oops issue in igb driver that Alex found in
>>>> v9, v9 still cannot work in my test. And in my test, disable link
>>>> reset(in host) in aer core for vfio-pci is the most significant step to
>>>> get my test passed.
>>> But is it the correct step? I'm not convinced. Why did blocking guest
>>> access not work? How do you plan to manage vfio taking the
>>> responsibility to perform a bus reset when you don't know whether QEMU
>>> is the user of the device or whether the user supports AER recovery?
>> Maybe currently we don't have enough proof to prove the correctness, but
>> I think I did find some facts to prove that link reset in host is a big
>> trouble, and can answer part of questions above.
>> 1st, some thoughts:
>> In pci-error-recovery.txt and do_recovery() of kernel tree, we can see,
>> a recovery consists of several steps(callbacks), link reset is one of
>> them, and except link reset, the others are seems kind of device
>> specific. In our case, both host & guest will do recovery, I think the
>> host recovery actually is some kind of fake recovery, see vfio-pci
>> driver's error_detected & resume callback, they don't do anything
>> special, mainly signal error to user, but the link reset in host "fake
>> reset" does some serious work, in other words, I think host does the
>> recovery incompletely, so I was thinking, why not just drop incompletely
>> host recovery(drop link reset) for vfio-pci, and let the guest take care
>> of the whole serious recovery. This is part of the reason of why my
>> version looks like this. But yes, I admit the issue Alex mentioned,
>> vfio can't guarantee that user will do a bus reset, this is an issue I
>> will keep looking for a solution.
>> 2nd, some facts and analyzation from test:
>> In host, the relationship between time and behviour in each component
>> roughly looks as following:
>> + HW + host kernel + qemu + guest kernel +
>> | |(error recovery)| | |
>> | | | | |
>> | | vfio-pci's | | |
>> | | error_detected | | |
>> | | + | | |
>> | | | | | |
>> | | | error notify | | |
>> | | | via eventfd | | |
>> | | +---------------> +----------+ | |
>> | | | +vfio_err_ | | |
>> | | | |notifier_ | | |
>> | +---- +<---+link reset | |handler | | |
>> | | HW | | | | | | |
>> | | | | | | | |
>> | |r | | vfio-pci's | |pass aer | | |
>> | |e.. | | resume | |to guest | | |
>> | |s. | | (blocking end) | | | | |
>> | |e | | | | *2* | | |
>> | |t | | | +----+-----+ | |
>> | |i | | | | | |
>> | |n | | | +--------> +----------+ |
>> | |g | | | | | guest | |
>> | | | | | | | recovery | |
>> | | | | | | | process | |
>> | | | | | | |(include | |
>> | | | | | | |register | |
>> | | *1* | | | | |access) | |
>> | | | | | | | | |
>> | | | | | | | *3* | |
>> | | | | +----------+ |
>> Time | | | | |
>> | | | | |
>> | | | | |
>> | | | | |
>> v | | | |
>> Now let me try to answer: Why did blocking guest access not work?
>> Some important factor:
>> 1. host recovery doesn't do anything special except error notifying, so
>> it may be executed very fast.
>> 2. Hardware resetting time is not sure, from pcie spec 6.6.1, guessing
>> it need many ms, pretty long?
>> some facts found in v9(block config write, not read, during host
>> recovery) test:
>> 1. reading uncor error register in vfio_err_notifier_handler sometimes
>> returns correct value, sometimes return invalid value(All F's)
>> So, I am thinking, if host blocking on host end early, and *2*, *3* is
>> parallel with *1*, the way used in v9 to blocking guest access, may not
> I don't think this answers Alex's question. He is simply asking
> which kind of errors are we recovering from, and why is FLR
> sufficient. All this info is completely lacking in
> current linux code, too. Why does it reset the link?
> Express spec certainly does not say we should.
> Link reset is the strongest kind of reset we can do though,
I feel there is a mistake here. I have said several times that I
understand and agree with Alex's points: if I disable host link reset,
then "guest bus reset should induce a host bus reset" on fatal error,
and FLR is not sufficient. This is a fault in this version, I will
address it in next version.
Alex's question "Why did blocking guest access not work?", make me feel
he is talking the way used in last version kernel patch[*], to block
config write, but not read, in host recovery. I guess it is also a
result of long discussion. But it[*] doesn't work well, that's why I
draw a figure to explain.
I am aware of "Link reset is the strongest kind of reset we can do
though", I even spent a whole day on "what the link reset exactly means"
digging in express spec, and induced some unrestrained thoughts:
(I don't have electrical engineering background, and the following
thoughts don't have relationship with our discussion)
set secondary bus reset bit direct physical layer to send TS1(or TS2)
twice(express spec 126.96.36.199) on the link, device received these TS, then
reset itself(my inference, didn't see exact words). So if data(TS) can
still be transferred correctly on physical layer, is link unreliable? or
the *unreliable* is targeting data link layer and transaction Layer?
> so maybe the thinking goes, let's just do it.
Sure I will. But I still have thoughts unsolved on the restrictions of
configuration, there may be another mail on this topic later.
> You would like to weaken this, but it's hard to
> be sure it's enough, we are dealing with misbehaving
> hardware after all.
> So what would help, is a list:
> - error type
> - recovery required
> - how to do it in a VM
> As for blocking guest access, I'm not sure why it's important
> to do. Guests can put devices in bad states, that's a given.
> Why isn't it enough to tell guest that device is in a bad state?
> I think it will try to avoid accesses then, if it does not,
> it is not supposed to do anything too bad.
>> Cao jin