Re: [PATCH v6] vfio error recovery: kernel support
From: Cao jin
Date: Thu Apr 06 2017 - 05:35:50 EST
On 04/06/2017 03:38 AM, Alex Williamson wrote:
> On Wed, 5 Apr 2017 16:54:33 +0800
> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>
>> Sorry for late. Distracted by other problem for a while.
>>
>> On 03/31/2017 02:16 AM, Alex Williamson wrote:
>>> On Thu, 30 Mar 2017 21:00:35 +0300
>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also asked in my previous comments to provide examples of errors that
>>>>>>>>> might trigger correctable errors to the user, this comment seems to
>>>>>>>>> have been missed. In my experience, AERs generated during device
>>>>>>>>> assignment are generally hardware faults or induced by bad guest
>>>>>>>>> drivers. These are cases where a single fatal error is an appropriate
>>>>>>>>> and sufficient response. We've scaled back this support to the point
>>>>>>>>> where we're only improving the situation of correctable errors and I'm
>>>>>>>>> not convinced this is worthwhile and we're not simply checking a box on
>>>>>>>>> an ill-conceived marketing requirements document.
>>>>>>>>
>>>>>>>> Sorry. I noticed that question: "what actual errors do we expect
>>>>>>>> userspace to see as non-fatal errors?", but I am confused about it.
>>>>>>>> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
>>>>>>>> and Uncorrectable Error Severity Register will tell which is fatal, and
>>>>>>>> which is non-fatal, this register is configurable, they are device
>>>>>>>> specific as I guess. AER core driver distinguish them by
>>>>>>>> pci_channel_io_normal/pci_channel_io_frozen, So I don't understand your
>>>>>>>> question. Or
>>>>>>>>
>>>>>>>> Or, Do you mean we could list the default non-fatal error of
>>>>>>>> Uncorrectable Error Severity Register which is provided by PCIe spec?
>>>>>>>
>>>>>>> I'm trying to ask why is this patch series useful. It's clearly
>>>>>>> possible for us to signal non-fatal errors for a device to a guest, but
>>>>>>> why is it necessarily a good idea to do so? What additional RAS
>>>>>>> feature is gained by this? Can we give a single example of a real
>>>>>>> world scenario where a guest has been shutdown due to a non-fatal error
>>>>>>> that the guest driver would have handled?
>>>>>>
>>>>>> We've been discussing AER for months if not years.
>>>>>> Isn't it a bit too late to ask whether AER recovery
>>>>>> by guests it useful at all?
>>>>>
>>>>>
>>>>> Years, but I think that is more indicative of the persistence of the
>>>>> developers rather than growing acceptance on my part. For the majority
>>>>> of that we were headed down the path of full AER support with the guest
>>>>> able to invoke bus resets. It was a complicated solution, but it was
>>>>> more clear that it had some value. Of course that's been derailed
>>>>> due to various issues and we're now on this partial implementation that
>>>>> only covers non-fatal errors that we assume the guest can recover from
>>>>> without providing it mechanisms to do bus resets. Is there actual
>>>>> value to this or are we just trying to fill an AER checkbox on
>>>>> someone's marketing sheet? I don't think it's too much to ask for a
>>>>> commit log to include evidence or discussion about how a feature is
>>>>> actually a benefit to implement.
>>>>
>>>> Seems rather self evident but ok. So something like
>>>>
>>>> With this patch, guest is able to recover from non-fatal correctable
>>>> errors - as opposed to stopping the guest with no ability to
>>>> recover which was the only option previously.
>>>>
>>>> Would this address your question?
>>>
>>>
>>> No, that's just restating the theoretical usefulness of this. Have you
>>> ever seen a non-fatal error? Does it ever trigger? If we can't
>>> provide a real world case of this being useful, can we at least discuss
>>> the types of things that might trigger a non-fatal error for which the
>>> guest could recover? In patch 3/3 Cao Jin claimed we have a 50% chance
>>> of reducing VM stop conditions, but I suspect this is just a misuse of
>>> statistics, just because there are two choices, fatal vs non-fatal,
>>> does not make them equally likely. Do we have any idea of the
>>> incidence rate of non-fatal errors? Is it non-zero? Thanks,
>>>
>>
>> Apparently, I don't have experience to induce non-fatal error, device
>> error is more of a chance related with the environment(temperature,
>> humidity, etc) as I understand.
>>
>> After reading the discussion, can I construe that the old design with
>> full AER support is preferred than this new one? The core issue of the
>> old design is that the second host link reset make the subsequent
>> guest's register reading fail, and I think this can be solved by test
>> the device's accessibility(read device' register, all F's means
>> unaccessible. IIRC, EEH of Power also use this way to test device's
>> accessiblity) and delay guest's reading if device is temporarily
>> unaccessible.
>
> What is the actual AER handling requirement you're trying to solve? Is
> it simply to check a box on a marketing spec sheet for anything related
> to AER with device assignment or is it to properly handle a specific
> class of errors which a user might actually see and thus provide some
> tangible RAS benefit?
The motion we implement this feature is neither from a marketing spec,
nor we have seen AER error occurs when using VM with pass-throughed PCIe
device. It is a potential deficiency we find and think we could improve it.
--
Sincerely,
Cao jin
> Without even anecdotal incidence rates of
> non-fatal errors and no plan for later incorporating fatal error
> handling, this feels like we're just trying to implement anything to do
> with AER with no long term vision.
>
> The previous intention of trying to handle all sorts of AER faults
> clearly had more value, though even there the implementation and
> configuration requirements restricted the practicality. For instance
> is AER support actually useful to a customer if it requires all ports
> of a multifunction device assigned to the VM? This seems more like a
> feature targeting whole system partitioning rather than general VM
> device assignment use cases. Maybe that's ok, but it should be a clear
> design decision.
>
> I think perhaps the specific issue of the device being in reset while
> the guest tries to access it is only a symptom of deeper problems. Who
> owns resetting the bus on error? My position was that we can't make
> the user solely responsible for that since we should never trust the
> user. If the host handles the reset, then does that imply we have both
> host and guest triggered resets and how do we handle the device during
> the host reset. It's not clear how we can necessarily correlate a
> guest reset to a specific host reset. Would it make more sense to
> allow the host AER handling to release some control to the user with
> vfio doing cleanup at the end. These are hard problems that need to be
> thought through. I don't want to continue on a path of "can I just fix
> this one next thing and it will be ok?". The entire previous design is
> suspect. If you want to move this area forward, take ownership of it
> and propose a complete, end-to-end solution. Thanks,
>
> Alex
>
>
> .
>