Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports

From: Terry Bowman
Date: Tue Jun 25 2024 - 10:31:53 EST

On 6/24/24 15:51, Dan Williams wrote:
> Terry Bowman wrote:
>> Hi Dan,
>> I added responses below.
>> On 6/21/24 14:04, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> This patchset provides RAS logging for CXL root ports, CXL downstream
>>>> switch ports, and CXL upstream switch ports. This includes changes to
>>>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>>>> cxl_pci callback.
>>>> The first 3 patches prepare for and add an atomic notifier chain to the
>>>> portdrv driver. The portdrv's notifier chain reports the port device's
>>>> AER internal errors to the registered callback(s). The preparation changes
>>>> include a portdrv update to call the uncorrectable handler for PCIe root
>>>> ports and PCIe downstream switch ports. Also, the AER correctable error
>>>> (CE) status is made available to the AER CE handler.
>>>> The next 4 patches are in preparation for adding an atomic notification
>>>> callback in the cxl_pci driver. This is for receiving AER internal error
>>>> events from the portdrv notifier chain. Preparation includes adding RAS
>>>> register block mapping, adding trace functions for logging, and
>>>> refactoring cxl_pci RAS functions for reuse.
>>>> The final 2 patches enable the AER internal error interrupts.
>>> [..]
>>>> Solutions Considered (1-4):
>>>> Below are solutions that were considered. Solution #4 is
>>>> implemented in this patchset.
>>> [..]
>>>> 2.) Update the AER driver to call cxl_pci driver's error handler before
>>>> calling pci_aer_handle_error()
>>>> This is similar to the existing RCH port error approach in aer.c.
>>>> In this solution the AER driver searches for a downstream CXL endpoint
>>>> to 'handle' detected CXL port protocol errors.
>>>> This is a good solution to consider if the one presented in this patchset
>>>> is not acceptable. I was initially reluctant to this approach because it
>>>> adds more CXL coupling to the AER driver. But, I think this solution
>>>> would technically work. I believe Ming was working towards this
>>>> solution.
>>> I feel like the coupling is warranted because these things *are* PCIe
>>> and CXL ports, but it means solving the interrupt distribution problem.
>> I understand the service driver interrupt issue but it is not clear how it
>> applies to the CXL port error handling. Can you help me understand how the
>> interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.
> Just the case of the AER MSI/-X vector being multiplexed with other CXL
> functionality on the same device. If the CXL interrupt vector is to be
> enabled later then it means MSI/-X vector enabling needs to be dynamic.
> ...but yeah, not a problem now as we are only talking about PCIe AER
> events and not multiplexing yet. I.e. that problem can be solved later.
>>>> 3.) Refactor portdrv
>>>> The portdrv refactoring solution is to change the portdrv service drivers
>>>> into PCIe auxiliary drivers. With this change the facility drivers can be
>>>> associated with a PCIe driver instead fixed bound to the portdrv driver.
>>>> In this case the CXL port functionality would be added either as a CXL
>>>> auxiliary driver or as a CXL specific port driver
>>>> This solution has challenges in the interrupt allocation by separate
>>>> auxiliary drivers and in binding of a specific driver. Binding is
>>>> currently based on PCIe class and would require extending the binding
>>>> logic to support multiple drivers for the same class.
>>>> Jonathan Cameron is working towards this solution by initially solving
>>>> for the PMU service driver.[1] It is using the auxiliary bus to associate
>>>> what were service drivers with the portdrv driver. Using a CXL auxiliary
>>>> for handling CXL port RAS errors would result in RAS logic called from
>>>> the cxl_pci and CXL auxiliary drivers. This may need a library driver.
>>> I don't think auxiliary bus is a fundamental step forward from pcie
>>> portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
>>> but with all the same problems around how to distribute interrupt
>>> services to different interested parties.
>>> So I think notifiers are interesting from the perspective of a software
>>> hack to enable interrupt distribution. However, given that dynamic MSI-X
>>> support is within reach I am interested in exploring that path and
>>> mandating that archs that want to handle CXL protocol errors natively
>>> need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
>>> native protocol error handling support via CXL _OSC.
>>> In other words, I expect native dynamic MSI-X support is more
>>> maintainable in the sense of keeping all the code in one notification
>>> domain.
>>>> 4.) Using a portdrv notifier chain/callback for CIE/UIE
>>>> (Implemented in this patchset)
>>>> This solution uses a portdrv atomic chain notifier and a cxl_pci
>>>> callback to handle and log CXL port RAS errors.
>>> Oh, I will need to look that the cxl_pci tie in for this, I was
>>> expecting cxl_pci only gets involved in the RCH case because the port
>>> and the endpoint are one in the same object. in the VH case I would only
>>> expect cxl_pci to get involved for its own observed protocol errors, not
>>> those reported upstream from that endpoint.
>> The CXL port error handling needs a place to live with few options at the moment.
>> Where do you want the CXL port error handlers to reside?
> I need to go understand exactly why cxl_pci is involved in this current
> proposal, but I was thinking it is probably more natural for cxl_port to
> have error handlers.

Ok. I agree, cxl_port is a better location for the handlers.
