Re: [PATCH] irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4

From: Marc Zyngier
Date: Sun Mar 12 2023 - 09:43:15 EST


On Fri, 10 Mar 2023 14:16:34 +0000,
Shanker Donthineni <sdonthineni@xxxxxxxxxx> wrote:
>
> Hi Marc,
>
> On 3/7/23 02:32, Marc Zyngier wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 06 Mar 2023 01:31:48 +0000,
> > Shanker Donthineni <sdonthineni@xxxxxxxxxx> wrote:
> >>
> >> The purpose of this patch is to address the T241 erratum T241-FABRIC-4,
> >> which causes unexpected behavior in the GIC when multiple transactions
> >
> > nit: "The purpose of this patch" is superfluous. Instead, write
> > something like:
> >
> > "The T241 platform suffers from the T241-FABRIC-4 erratum which
> > causes..."
> >
> I'll fix in v2 patch.
>
> >> are received simultaneously from different sources. This hardware issue
> >> impacts NVIDIA server platforms that use more than two T241 chips
> >> interconnected. Each chip has support for 320 {E}SPIs.
> >>
> >> This issue occurs when multiple packets from different GICs are
> >> incorrectly interleaved at the target chip. The erratum text below
> >> specifies exactly what can cause multiple transfer packets susceptible
> >> to interleaving and GIC state corruption. GIC state corruption can
> >> lead to a range of problems, including kernel panics, and unexpected
> >> behavior.
> >>
> >> From the erratum text:
> >> "In some cases, inter-socket AXI4 Stream packets with multiple
> >> transfers, may be interleaved by the fabric when presented to ARM
> >> Generic Interrupt Controller. GIC expects all transfers of a packet
> >> to be delivered without any interleaving.
> >>
> >> The following GICv3 commands may result in multiple transfer packets
> >> over inter-socket AXI4 Stream interface:
> >> - Register reads from GICD_I* and GICD_N*
> >> - Register writes to 64-bit GICD registers other than GICD_IROUTERn*
> >> - ITS command MOVALL
> >
> > Does is also affect cross-chip traffic such as SPI deactivation?
> >
> No, it is not impacted.
>
> >>
> >> Multiple commands in GICv4+ utilize multiple transfer packets,
> >> including VMOVP, VMOVI and VMAPP.
> >>
> >> This issue impacts system configurations with more than 2 sockets,
> >> that require multi-transfer packets to be sent over inter-socket
> >> AXI4 Stream interface between GIC instances on different sockets.
> >> GICv4 cannot be supported. GICv3 SW model can only be supported
> >> with the workaround. Single and Dual socket configurations are not
> >> impacted by this issue and support GICv3 and GICv4."
> >
> > Do you have a public link to this erratum? This is really something
> > that we should be go back to when changing things in the GIC code
> > (should we ever use MOVALL, for example).
> >
> https://developer.nvidia.com/docs/t241-fabric-4/nvidia-t241-fabric-4-errata.pdf

Great. Please add this to the commit message and a comment next to the
workaround code.

[...]

> >> +static inline void __iomem *gic_dist_base_read_alias(irq_hw_number_t intid)
> >> +{
> >> + struct dist_base_alias *base_alias;
> >> + int i;
> >> +
> >> + if (static_branch_unlikely(&gic_nvidia_t241_erratum)) {
> >> + base_alias = gic_data.base_read_aliases;
> >> + for (i = 0; i < gic_data.nr_dist_base_aliases; i++) {
> >> + if (base_alias->base &&
> >> + (intid >= base_alias->intid_start) &&
> >> + (intid <= base_alias->intid_end)) {
> >> + return base_alias->base;
> >> + }
> >> + base_alias++;
> >> + }
> >> + }
> >
> > Each distributor has the exact same number of SPIs. So why isn't this
> > just a division that gives you a distributor number?
> >
>
> I considered creating a generic function that could potentially be
> utilized in the future for other Workarounds (WARs).
>
> I'll change to this in v2.
>
> static inline void __iomem *gic_dist_base_alias(irq_hw_number_t intid)
> {
> u32 chip;
>
> if (static_branch_unlikely(&gic_nvidia_t241_erratum)) {
> /**
> * {E}SPI mappings for all 4 chips
> * Chip0 = 32-351
> * Chip1 = 52-671

s/52/352/, right?

> * Chip2 = 672-991
> * Chip3 = 4096-4415
> */
> switch (__get_intid_range(intid)) {
> case SPI_RANGE:
> chip = (intid - 32) / 320;
> break;
> case ESPI_RANGE:
> chip = 3;
> break;
> default:
> unreachable();
> }
> BUG_ON(!t241_dist_base_alias[chip]);

You can drop this BUG_ON(), and replace it with on at probe time.

> return t241_dist_base_alias[chip];
> }
>
> return gic_data.dist_base;
> }

Yup, that's much better.

>
> >> +
> >> + return gic_data.dist_base;
> >> +}
> >> +
> >> static inline void __iomem *gic_dist_base(struct irq_data *d)
> >> {
> >> switch (get_intid_range(d)) {
> >> @@ -346,7 +377,7 @@ static int gic_peek_irq(struct irq_data *d, u32 offset)
> >> if (gic_irq_in_rdist(d))
> >> base = gic_data_rdist_sgi_base();
> >> else
> >> - base = gic_data.dist_base;
> >> + base = gic_dist_base_read_alias(irqd_to_hwirq(d));
> >>
> >> return !!(readl_relaxed(base + offset + (index / 32) * 4) & mask);
> >> }
> >> @@ -580,6 +611,7 @@ static int gic_set_type(struct irq_data *d, unsigned int type)
> >> enum gic_intid_range range;
> >> unsigned int irq = gic_irq(d);
> >> void __iomem *base;
> >> + void __iomem *base_read_alias;
> >> u32 offset, index;
> >> int ret;
> >>
> >> @@ -594,14 +626,17 @@ static int gic_set_type(struct irq_data *d, unsigned int type)
> >> type != IRQ_TYPE_LEVEL_HIGH && type != IRQ_TYPE_EDGE_RISING)
> >> return -EINVAL;
> >>
> >> - if (gic_irq_in_rdist(d))
> >> + if (gic_irq_in_rdist(d)) {
> >> base = gic_data_rdist_sgi_base();
> >> - else
> >> + base_read_alias = base;
> >> + } else {
> >> base = gic_data.dist_base;
> >> + base_read_alias = gic_dist_base_read_alias(irqd_to_hwirq(d));
> >> + }
> >>
> >> offset = convert_offset_index(d, GICD_ICFGR, &index);
> >> -
> >> - ret = gic_configure_irq(index, type, base + offset, NULL);
> >> + ret = gic_configure_irq(index, type, base + offset, NULL,
> >> + base_read_alias + offset);
> >> if (ret && (range == PPI_RANGE || range == EPPI_RANGE)) {
> >> /* Misconfigured PPIs are usually not fatal */
> >> pr_warn("GIC: PPI INTID%d is secure or misconfigured\n", irq);
> >> @@ -1719,6 +1754,70 @@ static bool gic_enable_quirk_hip06_07(void *data)
> >> return false;
> >> }
> >>
> >> +static bool gic_enable_quirk_nvidia_t241(void *data)
> >> +{
> >> +#ifdef CONFIG_ACPI
> >> + struct dist_base_alias *base_alias;
> >> + struct acpi_table_header *madt;
> >> + int i, intid, nchips = 0;
> >> + acpi_status status;
> >> + phys_addr_t phys;
> >> +
> >> + status = acpi_get_table(ACPI_SIG_MADT, 0, &madt);
> >> + if (ACPI_FAILURE(status))
> >> + return false;
> >> +
> >> + /* Check NVIDIA OEM ID */
> >> + if (memcmp(madt->oem_id, "NVIDIA", 6)) {
> >
> > What guarantees do we have that this string will always be present?
> > "oem_id" is usually updated to reflect the integrator, not the
> > silicon vendor.
> >
>
> Our company provides UEFI firmware porting guidelines to OEMs that
> ensure the MADT table generation, along with the ACPI header, remains
> unaltered. Thanks to your input, we are now looking into alternative
> approaches for identifying platform types and removing our dependence
> on ACPI. Specifically, we are interested in utilizing the SMCCC API
> to detect the CHIP. Determine whether the individual chips are present
> by referring to the GICR regions described in MADT.

Seems like a reasonable alternative.

>
>
> >> + acpi_put_table(madt);
> >> + return false;
> >> + }
> >> +
> >> + /* Find the number of chips based on OEM_TABLE_ID */
> >> + if ((!memcmp(madt->oem_table_id, "T241x3", 6)) ||
> >> + (!memcmp(madt->oem_table_id, "T241c3", 6))) {
> >> + nchips = 3;
> >> + } else if ((!memcmp(madt->oem_table_id, "T241x4", 6)) ||
> >> + (!memcmp(madt->oem_table_id, "T241c4", 6))) {
> >> + nchips = 4;
> >> + }
> >
> > Same question for these. This seems pretty fragile.
> >
> This can be avoid for the SMCCC based platform detection.
>
> >> +
> >> + acpi_put_table(madt);
> >> + if (nchips < 3)
> >> + return false;
> >> +
> >> + base_alias = kmalloc_array(nchips, sizeof(*base_alias),
> >> + GFP_KERNEL | __GFP_ZERO);
> >
> > You are fully initialising the structures, right? So why the
> > __GFP_ZERO?
> Yes, not needed. will use the staic array since size is small after
> removing INTID_start/end feilds.
>
> >
> >> + if (!base_alias)
> >> + return false;
> >> +
> >> + gic_data.base_read_aliases = base_alias;
> >> + gic_data.nr_dist_base_aliases = nchips;
> >> +
> >> + /**
> >> + * Setup GICD alias and {E}SPIs range for each chip
> >> + * {E}SPI blocks mappings:
> >> + * Chip0 = 00-09
> >> + * Chip1 = 10-19
> >> + * Chip2 = 20-29
> >> + * Chip3 = 30-39
> >
> > What are these ranges? From the code below, I can (sort of) guess that
> > each chip has 10 registers in the SPI/ESPI range, with chips 0-1
> > dealing with SPIs, and chips 2-3 dealing with ESPIs.
> >
> > It would be a lot clearer if you indicated the actual INTID ranges.
> Agree.
>
> >
> >> + */
> >> + for (i = 0; i < nchips; i++, base_alias++) {
> >> + phys = ((1ULL << 44) * i) | 0x23580000;
> >
> > Where is this address coming from? Can it be inferred from the MADT?
> > It would also be a lot more readable if written as:
> >
> > #define CHIP_MASK GENMASK_ULL(45, 44)
> > #define CHIP_ALIAS_BASE 0x23580000
> >
> I'll define macros for constants. Use the offset, global GICD-PHYS,
> and CHIP number to get the alias addressses.
>
> #define T241_CHIPN_MASK GENMASK_ULL(45, 44)
> #define T241_CHIP_GICDA_OFFSET 0x1580000
>
> phys = gic_data.dist_phys_base + T241_CHIP_GICDA_OFFSET;
> phys |= FIELD_PREP(T241_CHIPN_MASK, i);
>
>
> > phys = CHIP_ALIAS_BASE;
> > phys |= FIELD_PREP(CHIP_MASK, i);
> >
> >> + base_alias->base = ioremap(phys, SZ_64K);
> >> + WARN_ON(!base_alias->base);
> >> +
> >> + intid = i < 3 ? 32 + i * 10 * 32 : ESPI_BASE_INTID;
> >> + base_alias->intid_start = intid;
> >> + base_alias->intid_end = intid + 10 * 32 - 1;
> >
> > This really is obfuscated. And it also shows that we really don't need
> > the INTID ranges in the data structure. You can easily get to the chip
> > number with something like:
> ACK
>
> >
> > switch (__get_intid_range(intid)) {
> > case SPI_RANGE:
> > chip = (intid - 32) / 320;
> > break;
> > case ESPI_RANGE:
> > chip = (intid - ESPI_BASE_INTID) / 320;
> > break;
> > }
> >
> > alias = base_alias[chip];
> >
> > Bonus point if you add a #define for the magic numbers.
> >
> ACK
>
> >> + }
> >> + static_branch_enable(&gic_nvidia_t241_erratum);
> >> + return true;
> >> +#else
> >> + return false;
> >> +#endif
> >> +}
> >
> > How about moving the whole function under #ifdef CONFIG_ACPI?
> >
>
> If you're not satisfied with SMCCC-based platform detection, I'll
> make the necessary changes. We value your input and would appreciate
> your opinion on whether we should use SMCCC or ACPI-OEM-ID based
> platform detection. Our preference is to go with SMC if that's
> agreeable to you.

If you can guarantee that this FW-based discovery will always be
available, then this is a more robust way of doing it.

>
>
> #define SMCCC_JEP106_BANK_ID(v) FIELD_GET(GENMASK(30, 24), (v))
> #define SMCCC_JEP106_ID_CODE(v) FIELD_GET(GENMASK(22, 16), (v))
> #define SMCCC_JEP106_SOC_ID(v) FIELD_GET(GENMASK(15, 0), (v))
>
> #define JEP106_NVIDIA_BANK_ID 0x3
> #define JEP106_NVIDIA_ID_CODE 0x6b
> #define T241_CHIPN_MASK GENMASK_ULL(45, 44)
> #define T241_CHIP_GICDA_OFFSET 0x1580000
> #define T241_CHIP_ID 0x241
>
> static bool gic_enable_quirk_nvidia_t241(void *data)
> {
> unsigned long chip_bmask = 0;
> struct arm_smccc_res res;
> phys_addr_t phys;
> u32 i;
>
> if ((arm_smccc_get_version() < ARM_SMCCC_VERSION_1_2) ||
> (arm_smccc_1_1_get_conduit() == SMCCC_CONDUIT_NONE)) {
> return false;
> }
>
> arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
> ARM_SMCCC_ARCH_SOC_ID, &res);
> if ((s32)res.a0 < 0)
> return false;
>
> arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_SOC_ID, 0, &res);
> if ((s32)res.a0 < 0)
> return false;

Most of this should probably directly come from the soc_id
infrastructure. It would need to probe early and expose the low-level
data.

>
> /* Check JEP106 code for NVIDIA T241 chip (036b:0241) */
> if ((SMCCC_JEP106_BANK_ID(res.a0) != JEP106_NVIDIA_BANK_ID) ||
> (SMCCC_JEP106_ID_CODE(res.a0) != JEP106_NVIDIA_ID_CODE) ||
> (SMCCC_JEP106_SOC_ID(res.a0) != T241_CHIP_ID)) {
> return false;
> }
>
> /* Find the chips based on GICR regions PHYS addr */
> for (i = 0; i < gic_data.nr_redist_regions; i++) {
> chip_bmask |= BIT(FIELD_GET(T241_CHIPN_MASK,
> gic_data.redist_regions[i].phys_base));
> }
>
> if (hweight32(chip_bmask) < 3)
> return false;
>
> /* Setup GICD alias regions */
> for (i = 0; i < ARRAY_SIZE(t241_dist_base_alias); i++) {
> if (chip_bmask & BIT(i)) {
> phys = gic_data.dist_phys_base + T241_CHIP_GICDA_OFFSET;
> phys |= FIELD_PREP(T241_CHIPN_MASK, i);
> t241_dist_base_alias[i] = ioremap(phys, SZ_64K);
> WARN_ON(!t241_dist_base_alias[i]);
> }
> }
> static_branch_enable(&gic_nvidia_t241_erratum);
> return true;
> }

Thanks,

M.

--
Without deviation from the norm, progress is not possible.