Re: [PATCH v8 2/3] Documentation: add a isolation strategy sysfs node for uacce
From: Greg KH
Date: Mon Sep 19 2022 - 05:33:47 EST
On Mon, Sep 19, 2022 at 11:21:30AM +0800, yekai (A) wrote:
>
>
> On 2022/9/9 16:27, Greg KH wrote:
> > On Fri, Sep 02, 2022 at 03:13:03AM +0000, Kai Ye wrote:
> >> Update documentation describing sysfs node that could help to
> >> configure isolation strategy for users in the user space. And
> >> describing sysfs node that could read the device isolated state.
> >>
> >> Signed-off-by: Kai Ye <yekai13@xxxxxxxxxx>
> >> ---
> >> Documentation/ABI/testing/sysfs-driver-uacce | 26 ++++++++++++++++++++
> >> 1 file changed, 26 insertions(+)
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> >> index 08f2591138af..af5bc2f326d2 100644
> >> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> >> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> >> @@ -19,6 +19,32 @@ Contact: linux-accelerators@xxxxxxxxxxxxxxxx
> >> Description: Available instances left of the device
> >> Return -ENODEV if uacce_ops get_available_instances is not provided
> >>
> >> +What: /sys/class/uacce/<dev_name>/isolate_strategy
> >> +Date: Sep 2022
> >> +KernelVersion: 6.0
> >> +Contact: linux-accelerators@xxxxxxxxxxxxxxxx
> >> +Description: (RW) Configure the frequency size for the hardware error
> >> + isolation strategy. This size is a configured integer value.
> >> + The default is 0. The maximum value is 65535. This value is a
> >> + threshold based on your driver strategies.
> > I do not understand what the units are here.
> >
> > How is anyone supposed to know what they are?
>
> This unit is the number of times. Number of occurrences in a period, also means threshold.
> If the number of device pci AER error exceeds the threshold in a time window, the device is
> isolated.
Please document this very very well.
> >> + For example, in the hisilicon accelerator engine, first we will
> >> + time-stamp every slot AER error. Then check the AER error log
> >> + when the device AER error occurred. if the device slot AER error
> >> + count exceeds the preset the number of times in one hour, the
> >> + isolated state will be set to true. So the device will be
> >> + isolated. And the AER error log that exceed one hour will be
> >> + cleared. Of course, different strategies can be defined in
> >> + different drivers.
> > So this file can contain values of different units depending on the
> > different driver that creates it? How is anyone supposed to know what
> > it is and what it should be?
> >
> > This feels very loose, please define this much better so that it can be
> > understood and maintained properly.
> >
> > thanks,
> >
> > greg k-h
> > .
> >
> Yes, We started out with the idea of not restricting the different drive, only specifying the input and output.
> Because we think different drivers require different processing strategy.
What different drivers? You only have one! And why do you need a
framework for only one driver? You should only add that when you have
multiple users to ensure you got the framework correct otherwise you do
not know how it will be used.
thanks,
greg k-h