Re: [PATCH] irqchip, gicv3-its, numa: Workaround for Cavium ThunderX erratum 23144

From: Marc Zyngier
Date: Mon Aug 24 2015 - 08:45:47 EST


On 24/08/15 13:30, Ganapatrao Kulkarni wrote:
> Hi Marc,
>
> thanks for the review comments.
>
> On Mon, Aug 24, 2015 at 3:47 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>> Hi Robert,
>>
>> Just came back from the Seattle madness, so picking up patches in
>> reverse order... ;-)
>>
>> On 22/08/15 14:10, Robert Richter wrote:
>>> The patch below adds a workaround for gicv3 in a numa environment. It
>>> is on top of my recent gicv3 errata patch submission v4 and Ganapat's
>>> arm64 numa patches for devicetree v5.
>>>
>>> Please comment.
>>>
>>> Thanks,
>>>
>>> -Robert
>>>
>>>
>>>
>>> From c432820451a46b8d1e299b8bfbfcdcb3b75340e7 Mon Sep 17 00:00:00 2001
>>> From: Ganapatrao Kulkarni <gkulkarni@xxxxxxxxxxxxxxxxxx>
>>> Date: Wed, 19 Aug 2015 23:40:05 +0530
>>> Subject: [PATCH] irqchip, gicv3-its, numa: Workaround for Cavium ThunderX erratum
>>> 23144
>>>
>>> This implements a workaround for gicv3-its erratum 23144 applicable
>>> for Cavium's ThunderX multinode systems.
>>>
>>> The erratum fixes the hang of ITS SYNC command by avoiding inter node
>>> io and collections/cpu mapping. This fix is only applicable for
>>> Cavium's ThunderX dual-socket platforms.
>>
>> Can you please elaborate on this? I can't see any reference to the SYNC
>> command there. Is that a case of an ITS not being able to route LPIs to
>> cores on another socket?
> we were seeing mapc command failing when we were mapping its of node0
> with collections of node1(vice-versa).

There is no such thing as "collection of node1". There are collections
mapped to redistributors.

> we found sync was timing out, which is issued post mapc(also for mapvi
> and movi).
> Yes this errata limits the routing of inter-node LPIs.

Please update the commit message to reflect the actual issue.

>
>>
>> I really need more details to understand this patch (please use short
>> sentences, I'm still in a different time zone).
>>
>>>
>>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@xxxxxxxxxxxxxxxxxx>
>>> [ rric: Reworked errata code, added helper functions, updated commit
>>> message. ]
>>>
>>> Signed-off-by: Robert Richter <rrichter@xxxxxxxxxx>
>>> ---
>>> arch/arm64/Kconfig | 14 +++++++++++
>>> drivers/irqchip/irq-gic-common.c | 5 ++--
>>> drivers/irqchip/irq-gic-v3-its.c | 54 ++++++++++++++++++++++++++++++++++------
>>> 3 files changed, 64 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 3809187ed653..b92b7b70b29b 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -421,6 +421,20 @@ config ARM64_ERRATUM_845719
>>>
>>> If unsure, say Y.
>>>
>>> +config CAVIUM_ERRATUM_22375
>>> + bool "Cavium erratum 22375, 24313"
>>> + depends on NUMA
>>> + default y
>>> + help
>>> + Enable workaround for erratum 22375, 24313.
>>> +
>>> +config CAVIUM_ERRATUM_23144
>>> + bool "Cavium erratum 23144"
>>> + depends on NUMA
>>> + default y
>>> + help
>>> + Enable workaround for erratum 23144.
>>> +
>>> config CAVIUM_ERRATUM_23154
>>> bool "Cavium erratum 23154: Access to ICC_IAR1_EL1 is not sync'ed"
>>> depends on ARCH_THUNDER
>>> diff --git a/drivers/irqchip/irq-gic-common.c b/drivers/irqchip/irq-gic-common.c
>>> index ee789b07f2d1..1dfce64dbdac 100644
>>> --- a/drivers/irqchip/irq-gic-common.c
>>> +++ b/drivers/irqchip/irq-gic-common.c
>>> @@ -24,11 +24,12 @@
>>> void gic_check_capabilities(u32 iidr, const struct gic_capabilities *cap,
>>> void *data)
>>> {
>>> - for (; cap->desc; cap++) {
>>> + for (; cap->init; cap++) {
>>> if (cap->iidr != (cap->mask & iidr))
>>> continue;
>>> cap->init(data);
>>> - pr_info("%s\n", cap->desc);
>>> + if (cap->desc)
>>> + pr_info("%s\n", cap->desc);
>>
>> No. I really want to see what errata are applied when I look at a kernel
>> log.
> sorry, did not understood your comment, it is still printed using cap->desc.

Yes, but you are making desc optional, and I don't want it to be
optional. I want the kernel to scream that we're using an erratum
workaround so that we can understand what is happening when reading a
kernel log.

>>> }
>>> }
>>>
>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
>>> index 4bb135721e72..666be39f13a9 100644
>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>> @@ -43,7 +43,8 @@
>>> #include "irqchip.h"
>>>
>>> #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING (1ULL << 0)
>>> -#define ITS_FLAGS_CAVIUM_THUNDERX (1ULL << 1)
>>> +#define ITS_WORKAROUND_CAVIUM_22375 (1ULL << 1)
>>> +#define ITS_WORKAROUND_CAVIUM_23144 (1ULL << 2)
>>>
>>> #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING (1 << 0)
>>>
>>> @@ -76,6 +77,7 @@ struct its_node {
>>> struct list_head its_device_list;
>>> u64 flags;
>>> u32 ite_size;
>>> + int numa_node;
>>> };
>>>
>>> #define ITS_ITT_ALIGN SZ_256
>>> @@ -609,11 +611,18 @@ static void its_eoi_irq(struct irq_data *d)
>>> static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
>>> bool force)
>>> {
>>> - unsigned int cpu = cpumask_any_and(mask_val, cpu_online_mask);
>>> + unsigned int cpu;
>>> struct its_device *its_dev = irq_data_get_irq_chip_data(d);
>>> struct its_collection *target_col;
>>> u32 id = its_get_event_id(d);
>>>
>>> + if (its_dev->its->flags & ITS_WORKAROUND_CAVIUM_23144) {
>>> + cpu = cpumask_any_and(mask_val,
>>> + cpumask_of_node(its_dev->its->numa_node));
>>
>> I suppose cpumask_of_node returns only the *online* cores of a given
>> node, right?
> yes.
>>
>>> + } else {
>>> + cpu = cpumask_any_and(mask_val, cpu_online_mask);
>>> + }
>>> +
>>> if (cpu >= nr_cpu_ids)
>>> return -EINVAL;
>>>
>>> @@ -845,7 +854,7 @@ static int its_alloc_tables(struct its_node *its)
>>> u64 typer;
>>> u32 ids;
>>>
>>> - if (its->flags & ITS_FLAGS_CAVIUM_THUNDERX) {
>>> + if (its->flags & ITS_WORKAROUND_CAVIUM_22375) {
>>> /*
>>> * erratum 22375: only alloc 8MB table size
>>> * erratum 24313: ignore memory access type
>>> @@ -1093,6 +1102,11 @@ static void its_cpu_init_lpis(void)
>>> dsb(sy);
>>> }
>>>
>>> +static inline int numa_node_id_early(void)
>>> +{
>>> + return MPIDR_AFFINITY_LEVEL(read_cpuid_mpidr(), 2);
>>> +}
>>> +
>>> static void its_cpu_init_collection(void)
>>> {
>>> struct its_node *its;
>>> @@ -1104,6 +1118,11 @@ static void its_cpu_init_collection(void)
>>> list_for_each_entry(its, &its_nodes, entry) {
>>> u64 target;
>>>
>>> + /* avoid cross node core and its mapping */
>>> + if ((its->flags & ITS_WORKAROUND_CAVIUM_23144) &&
>>> + its->numa_node != numa_node_id_early())
>>> + continue;
>>> +
>>
>> Argh. This is horrible. You really need some topology bindings to
>> describe your system instead of hardcoding some random level of
>> affinity. The next time someone is going to come up with a similarly
>> broken system, they will have to reinvent that wheel.
> thanks for the suggestion, we will use cpu_topology[cpuid].cluster_id
> instead of function numa_node_id_early()

Make sure you can relate the ITS to it (have a way to find out which
node a given ITS belong to, without playing tricks with the memory map.

>>
>>> /*
>>> * We now have to bind each collection to its target
>>> * redistributor.
>>> @@ -1372,9 +1391,15 @@ static void its_irq_domain_activate(struct irq_domain *domain,
>>> {
>>> struct its_device *its_dev = irq_data_get_irq_chip_data(d);
>>> u32 event = its_get_event_id(d);
>>> + unsigned int cpu;
>>> +
>>> + if (its_dev->its->flags & ITS_WORKAROUND_CAVIUM_23144)
>>> + cpu = cpumask_first(cpumask_of_node(its_dev->its->numa_node));
>>> + else
>>> + cpu = cpumask_first(cpu_online_mask);
>>
>> Looks like this can be factored with the code you've added in set_affinity.
> we cant issue mapvi with cross-node mapping.

And? You have almost the same code twice. Surely you can devise a single
helper that holds this code.

>>>
>>> /* Bind the LPI to the first possible CPU */
>>> - its_dev->event_map.col_map[event] = cpumask_first(cpu_online_mask);
>>> + its_dev->event_map.col_map[event] = cpu;
>>>
>>> /* Map the GIC IRQ and event to the device */
>>> its_send_mapvi(its_dev, d->hwirq, event);
>>> @@ -1457,16 +1482,31 @@ static int its_force_quiescent(void __iomem *base)
>>> }
>>> }
>>>
>>> +static inline int its_get_node_thunderx(struct its_node *its)
>>> +{
>>> + return (its->phys_base >> 44) & 0x3;
>>
>> Why 3? Is that because you have provision for 4 sockets or what?
> yes.
>>
>>> +}
>>> +
>>> static void its_enable_cavium_thunderx(void *data)
>>> {
>>> - struct its_node *its = data;
>>> + struct its_node __maybe_unused *its = data;
>>>
>>> - its->flags |= ITS_FLAGS_CAVIUM_THUNDERX;
>>> +#ifdef CONFIG_CAVIUM_ERRATUM_22375
>>> + its->flags |= ITS_WORKAROUND_CAVIUM_22375;
>>> + pr_info("ITS: Enabling workaround for 22375, 24313\n");
>>> +#endif
>>> +
>>> +#ifdef CONFIG_CAVIUM_ERRATUM_23144
>>> + if (num_possible_nodes() > 1) {
>>> + its->numa_node = its_get_node_thunderx(its);
>>
>> I'd rather see numa_node being always initialized to something useful.
>> If you're adding numa support, why can't this be initialized via
>> standard topology bindings?
> IIUC, topology defines only cpu topology.

Well, welcome to a much more complex system where both your CPUs and
your IOs have some degree of affinity. This needs to be described
properly, and not hacked on the side.

Thanks,

M.
--
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/