RE: [PATCH v2 0/3] Add support for RAS DES feature in PCIe DW controller
From: Shradha Todi
Date: Fri Mar 22 2024 - 07:37:33 EST
> -----Original Message-----
> From: 'Manivannan Sadhasivam' <manivannan.sadhasivam@xxxxxxxxxx>
> Sent: 22 March 2024 16:10
> To: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> Cc: Shradha Todi <shradha.t@xxxxxxxxxxx>; bp@xxxxxxxxx;
> tony.luck@xxxxxxxxx; james.morse@xxxxxxx; mchehab@xxxxxxxxxx;
> rric@xxxxxxxxxx; lpieralisi@xxxxxxxxxx; kw@xxxxxxxxx; robh@kernelorg;
> bhelgaas@xxxxxxxxxx; jingoohan1@xxxxxxxxx;
> gustavo.pimentel@xxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx;
> lukas.bulwahn@xxxxxxxxx; hongxing.zhu@xxxxxxx;
> pankaj.dubey@xxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
> pci@xxxxxxxxxxxxxxx; vidyas@xxxxxxxxxx; gost.dev@xxxxxxxxxxx;
> alim.akhtar@xxxxxxxxxxx; shiju.jose@xxxxxxxxxx; Terry Bowman
> <Terry.Bowman@xxxxxxx>
> Subject: Re: [PATCH v2 0/3] Add support for RAS DES feature in PCIe DW
> controller
>
> On Wed, Mar 20, 2024 at 10:01:44AM +0000, Jonathan Cameron wrote:
> > On Tue, 19 Mar 2024 22:03:15 +0530
> > 'Manivannan Sadhasivam' <manivannan.sadhasivam@xxxxxxxxxx> wrote:
> >
> > > On Thu, Feb 22, 2024 at 04:30:47PM +0530, Shradha Todi wrote:
> > > > + Borislav, Tony, James, Mauro, Robert
> > > >
> > > > Hi All,
> > > >
> > > > Synopsys DesignWare PCIe controllers have a vendor specific
> > > > capability (which means that this set of registers are only
> > > > present in DesignWare controllers) to perform debug operations called
> "RASDES".
> > > > The functionalities provided by this extended capability are:
> > > >
> > > > 1. Debug: This has some debug related diagnostic features like
> > > > holding LTSSM in certain states, reading the status of lane
> > > > detection, checking if any PCIe lanes are broken (RX Valid) and so
> > > > on. It's a debug only feature used for diagnostic use-cases.
> > > >
> > > > 2. Error Injection: This is a way to inject certain errors in PCIe
> > > > like LCRC, ECRC, Bad TLPs and so on. Again, this is a debug
> > > > feature and generally not used in functional use-case.
> > > >
> > > > 3. Statistical counters: This has 3 parts
> > > > - Error counters
> > > > - Non error counters (covered as part of perf [1])
> > > > - Time based analysis counters (covered as part of perf [1])
> > > >
> > > > Selective features of the above functionality has been
> > > > implemented by vendor specific PCIe controller drivers
> > > > (pcie-tegra194.c) that use Synopsys DesignWare PCIe controllers.
> > > > In order to make it useful to all vendors using DWC controller, we
> > > > had proposed a common implementation in DWC PCIe controller
> > > > directory
> > > > (drivers/pci/controller/dwc/) and our original idea was based on
> > > > debugfs filesystem. v1 and v2 are mentioned in [2] and [3].
> > > >
> > > > We got a suggestion to implement this as part of EDAC framework
> > > > [3] and we looked into the same. But as far as I understood, what
> > > > I am trying to implement is a very specific feature (only valid for Synopsys
> DWC PCIe controllers).
> >
> > For error part there are (at least superficially) similar features in
> > the PCIe standard that we've started thinking about how to support.
> >
> > See Flit Logging Extended capablity (7.7.8 in PCIe Base Spec rev6.
> > That has the benefit that they are part of the standard so we can
> > support them directly in portdrv / EP drivers using some library code
> > in the PCI core.
> >
>
> Sounds good. But v6 is a relatively new version and the DWC RAS predates that.
> So we still need to support it somehow (either in EDAC or in
> drivers/pci/controller/dwc).
>
For all PCIe spec standard errors (All of those mentioned in 6.2 in PCIe Base Spec rev6 like AER,
FLIT logging, DPC, SR-IOV Baseline Error Handling), extending the EDAC / portdrv / EP drivers
seems like the best way. But this particular RAS-DES functionality exists only for Synopsys
DesignWare controller, hence I feel the *DWC controller driver* should extend this as it is a
controller specific feature. What would you suggest?
Unless there a plan to support vendor specific implementations through EDAC or
portdrv callbacks? Because writing a new pci driver would cause 2 driver binding problem.
> > There are other interconnect and PCI PMU drivers that log retries etc
> > which are also basically error counts. At least some of that is done through perf
> today.
> >
>
> IMO all the RAS support should be exposed through EDAC, otherwise it defeats
> the purpose of the subsystem.
>
> - Mani
>
Yes, I see that lot of non-error based counters like memory packet tx/rx received,
nak dllp received, etc is done as a part of perf today. But I believe, the functionality I want to
implement has features like error injection which would not be suitable to be a part of perf.
> >
> > > > This doesn't seem to fit in very well with the EDAC framework and
> > > > we can hardly use any of the EDAC framework APIs. We tried
> > > > implementing a "pci_driver" but since a function driver will
> > > > already be running on the EP and portdrv on the root-complex, we
> > > > will not be able to bind 2 drivers to a single PCI device
> > > > (root-complex or endpoint). Ultimately, what I will be doing is
> > > > writing a platform driver with debugfs entries which will be present in EDAC
> directory instead of DWC directory.
> >
> > The addition of this type of functionality to pordrv is a long running question.
> > Everyone wants a solution, I believe some people are looking at it
> > (+CC Terry)
> >
> > Terry, another case for your long list.
> >
> > For the EP end, this should be fired up by the EP driver, whilst it
> > might be infrastructure used on a bunch of devices, it is a feature
> > of that particular EP - so you'd want to provide any functionality in
> > a form that could be used by both the EP driver and a nice shiny new portdrv
> replacement.
> >
> > > >
> > > > Can you please help us out by going through this thread [3] and
> > > > letting us know if our understanding is wrong at any point. If you
> > > > think it is a better idea to integrate this in the EDAC framework,
> > > > can you guide me as to how I can utilize the framework better?
> > > > Please let me know if you need any other information to conclude.
> > > >
> > > > [1]
> > > > https://lore.kernel.org/linux-pci/20231121013400.18367-1-xueshuai@
> > > > linux.alibaba.com/ [2]
> > > > https://lore.kernel.org/all/20210518174618.42089-1-shradha.t@samsu
> > > > ng.com/T/ [3]
> > > > https://lore.kernel.org/all/20231130115044.53512-1-shradha.t@samsu
> > > > ng.com/
> > > >
> > >
> > > Gentle ping for the EDAC maintainers.
> > >
> > > - Mani
> > >
> > > > Thanks,
> > > > Shradha
> > > >
> > > > > -----Original Message-----
> > > > > From: 'Manivannan Sadhasivam' <manivannan.sadhasivam@xxxxxxxxxx>
> > > > > Sent: 16 February 2024 19:19
> > > > > To: Shradha Todi <shradha.t@xxxxxxxxxxx>
> > > > > Cc: lpieralisi@xxxxxxxxxx; kw@xxxxxxxxx; robh@xxxxxxxxxx;
> > > > > bhelgaas@xxxxxxxxxx; jingoohan1@xxxxxxxxx;
> > > > > gustavo.pimentel@xxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx;
> > > > > lukas.bulwahn@xxxxxxxxx; hongxing.zhu@xxxxxxx;
> > > > > pankaj.dubey@xxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
> > > > > pci@xxxxxxxxxxxxxxx; vidyas@xxxxxxxxxx; gost.dev@xxxxxxxxxxx
> > > > > Subject: Re: [PATCH v2 0/3] Add support for RAS DES feature in
> > > > > PCIe DW controller
> > > > >
> > > > > On Thu, Feb 15, 2024 at 02:55:06PM +0530, Shradha Todi wrote:
> > > > > >
> > > > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > > > For the error injection and counters, we already have the
> > > > > > > EDAC framework. So adding them in the DWC driver doesn't make
> sense to me.
> > > > > > >
> > > > > >
> > > > > > Sorry for late response, was going through the EDAC framework
> > > > > > to understand
> > > > > better how we can fit RAS DES support in it. Below are some
> > > > > technical challenges found so far:
> > > > > > 1: This debugfs framework proposed [1] can run on both side of
> > > > > > the link i.e. RC
> > > > > and EP as it will be a part of the link controller platform
> > > > > driver. Here for the EP side the assumption is that it has Linux
> > > > > running, which is primarily a use case for chip-to-chip
> > > > > communication. After your suggestion to migrate to EDAC framework we
> studied and here are the findings:
> > > > > > - If we move to EDAC framework, we need to have RAS DES as a
> > > > > > pci_driver which will be binded based on vendor_id and
> > > > > > device_id. Our observation is that on EP side system we are
> > > > > > unable to bind two function driver (pci_driver), as
> > > > > > pci_endpoint_test function driver or some other chip-to-chip
> > > > > > function driver will already be bound. On the other hand, on
> > > > > > RC side we observed that if we have portdrv enabled in Linux
> > > > > > running on RC system, it gets bound to RC controller and then
> > > > > > it does not allow EDAC pci_driver to bind. So basically we see
> > > > > > a problem here, that we can't have two pci_driver binding to
> > > > > > same PCI device
> > > > > > 2: Another point is even though we use EDAC driver framework,
> > > > > > we may not be
> > > > > able to use any of EDAC framework APIs as they are mostly
> > > > > suitable for memory controller devices sitting on PCI BUS. We
> > > > > will end up using debugfs entries just via a pci_driver placed inside EDAC
> framework.
> > > > >
> > > > > Please wrap your replies to 80 characters.
> > > > >
> > > > > There is no need to bind the edac driver to VID:PID of the
> > > > > device. The edac driver can be a platform driver and you can
> > > > > instantiate the platform device from the DWC driver. This way,
> > > > > the PCI device can be assocaited with whatever driver, but still there can
> be a separate edac driver for handling errors.
> > > > >
> > > > > Regarding API limitation, you should ask the maintainer about
> > > > > the possibility of extending them.
> > > > >
> > > > > >
> > > > > > Please let me know if my understanding is wrong.
> > > > > >
> > > > > > > But first check with the perf driver author if they have any
> > > > > > > plans on adding the proposed functionality. If they do not
> > > > > > > have any plan or not working on it, then look into EDAC.
> > > > > > >
> > > > > > > - Mani
> > > > > > >
> > > > > >
> > > > > > Since we already worked and posted patches [1], [2], we will
> > > > > > continue to work
> > > > > on this and based on consent from community we will adopt to
> > > > > most suitable framework.
> > > > > > We see many subsystems like ethernet, usb, gpu, cxl having
> > > > > > debugfs files that
> > > > > give information about the current status of the running system
> > > > > and as of now based on our findings, we still feel there is no
> > > > > harm in having debugfs entry based support in DesignWare controller
> driver itself.
> > > > >
> > > > > There is no issue in exposing the debug information through
> > > > > debugfs, that's the sole purpose of the interface. But here, you
> > > > > are trying to add support for DWC RAS feature for which a dedicated
> framework already exists.
> > > > >
> > > > > And there will be more similar requests coming for vendor
> > > > > specific error protocols as well. So your investigation could benefit
> everyone.
> > > > >
> > > > > From your above investigation, looks like there are some
> > > > > shortcomings of the EDAC framework. So let's get that clarified
> > > > > by writing to the EDAC maintainers (keep us in CC). If the EDAC
> > > > > maintainer suggests you to add support for this feature in DWC driver
> itself citing some reasons, then no issues with me.
> > > > >
> > > > > - Mani
> > > > >
> > > > > --
> > > > > மணிவண்ணன் சதாசிவம்
> > > >
> > > >
> > >
> >
>
> --
> மணிவண்ணன் சதாசிவம்