Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

From: Terry Bowman
Date: Tue Oct 22 2024 - 09:50:43 EST


Hi Dan,

On 10/21/24 20:53, Dan Williams wrote:
> Terry Bowman wrote:
>> CXL protocol errors are reported to the OS through PCIe correctable and
>> uncorrectable internal errors. However, since CXL PCIe port devices
>> are currently bound to the portdrv driver, there is no mechanism to
>> notify the CXL driver, which is necessary for proper logging and
>> handling.
>>
>> To address this, introduce CXL PCIe port error callbacks along with
>> register/unregister and accessor functions. The callbacks will be
>> invoked by the AER driver in the case protocol errors are reported by
>> a CXL port device.
>>
>> The AER driver callbacks will be used in future patches implementing
>> CXL PCIe port error handling.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>> ---
>> drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
>> include/linux/aer.h | 14 ++++++++++++++
>> 2 files changed, 36 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 13b8586924ea..a9792b9576b4 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -50,6 +50,8 @@ struct aer_rpc {
>> DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>> };
>>
>> +static struct cxl_port_err_hndlrs cxl_port_hndlrs;
>
> I think this can afford to splurge on a few more letters and make this
>
> static struct cxl_port_error_handlers cxl_port_error_handlers;
>
>

Ok.

>> +
>> /* AER stats for the device */
>> struct aer_stats {
>>
>> @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
>> struct aer_err_info *info) { }
>> #endif
>>
>> +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
>> +{
>> + cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
>> + cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
>> +
>> +void unregister_cxl_port_hndlrs(void)
>> +{
>> + cxl_port_hndlrs.error_detected = NULL;
>> + cxl_port_hndlrs.cor_error_detected = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
>> +
>> +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
>> +{
>> + return &cxl_port_hndlrs;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);
>
> I guess I will need to go deeper into the code, but I would not have
> expected that new registration interfaces are needed. Each 'struct
> pci_driver' could optionally include CXL error handlers alongside their
> PCIe error handlers and when CXL AER errors are broadcast only the CXL
> handlers are invoked. I.e. the registration is something like:
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 6af5e0425872..42db26195bda 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
> .shutdown = pcie_portdrv_shutdown,
>
> .err_handler = &pcie_portdrv_err_handler,
> + .cxl_err_handler = &cxl_portdrv_err_handler,
>
> .driver_managed_dma = true,

Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type
'struct pci_error_handler'.

'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn
pointers that are used in PCIe recovery if available. The plan is for CXL devices to
call panic for UCE fatal and non-fatal but it might be good to use the
'struct pci_error_handler' type in case there are needs for the other handlers in
the future. It also makes the logic to access and use the error handlers common,
requiring less code.

Regards,
Terry