Re: [PATCH] pci-error-recover: doc cleanup

From: Cao jin
Date: Fri Dec 09 2016 - 02:55:52 EST




On 12/09/2016 02:44 PM, Linas Vepstas wrote:
> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>
>>
>> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>>> I suppose I'm confused, but I recall that link resets are non-fatal.
>>> Fatal errors typically require that the the pci adapter be completely
>>> reset, any adapter firmware to be reloaded from scratch, the device
>>> driver has to kill all device state and start from scratch. Its huge.
>>> If the fatal error is on pci device that is under a block device
>>> holding a file system, then (usually) there is no way to recover,
>>> because the block layer (and file system) cannot deal with a block
>>> device that disappeared and then reappeared some few seconds later.
>>> (maybe some future zfs or lvm or btrfs might be able to deal with
>>> this, but not today)
>>>
>>> By contrast, link resets are far more gentle: the device driver might
>>> have to discard some half-full FIFO's, or cancel some in-flight
>>> commands, but can otherwise gracefully recover without telling the
>>> higher layers that there were any problems.
>>>
>>> --linas
>>>
>>
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
>>
>> Fatal errors are uncorrectable error conditions which render the
>> particular Link and related hardware unreliable. For Fatal errors, a
>> reset of the components on the Link may be required to return to
>> reliable operation. Platform handling of Fatal errors, and any efforts
>> to limit the effects of these errors, is platform implementation specific.
>>
>> Link reset means set *secondary bus reset* bit in pci bridge config
>> space, can reset the link and device simultaneously, is the strongest
>> kind of reset as I know.
>
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device reset?
>

At least I don't find the exact words saying that.

--
Sincerely,
Cao jin

> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state. Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
>
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.
>
> --linas
>
>>
>>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>> The platform resets the link, and then calls the link_reset() callback
>>>>>> on all affected device drivers. This is a PCI-Express specific state
>>>>>> -and is done whenever a non-fatal error has been detected that can be
>>>>>> +and is done whenever a fatal error has been detected that can be
>>>>>> "solved" by resetting the link. This call informs the driver of the
>>>>>
>>>>> As far as I can tell, the original text was correct here; why do you
>>>>> think this change needs to be made?
>>>>>
>>>>
>>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>>> error.
>>>>
>>>> --
>>>> Sincerely,
>>>> Cao jin
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
>
>
> .
>