Re: [PATCH] LoongArch: Fix cpu hotplug issue

From: Huacai Chen
Date: Mon Oct 14 2024 - 06:12:58 EST


On Mon, Oct 14, 2024 at 6:01 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
>
>
>
> On 2024/10/14 下午5:29, Huacai Chen wrote:
> > On Mon, Oct 14, 2024 at 5:12 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2024/10/14 下午4:23, Huacai Chen wrote:
> >>> On Mon, Oct 14, 2024 at 4:01 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
> >>>>
> >>>> Huacai,
> >>>>
> >>>> On 2024/10/14 下午3:39, Huacai Chen wrote:
> >>>>> On Mon, Oct 14, 2024 at 3:21 PM maobibo <maobibo@xxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Huacai,
> >>>>>>
> >>>>>> On 2024/10/14 下午3:05, Huacai Chen wrote:
> >>>>>>> Hi, Bibo,
> >>>>>>>
> >>>>>>> I'm a little confused, so please correct me if I'm wrong.
> >>>>>>>
> >>>>>>> On Mon, Oct 14, 2024 at 2:33 PM Bibo Mao <maobibo@xxxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> On LoongArch system, there are two places to set cpu numa node. One
> >>>>>>>> is in arch specified function smp_prepare_boot_cpu(), the other is
> >>>>>>>> in generic function early_numa_node_init(). The latter will overwrite
> >>>>>>>> the numa node information.
> >>>>>>>>
> >>>>>>>> However for hot-added cpu, cpu_logical_map() fails to its physical
> >>>>>>>> cpuid at beginning since it is not enabled in ACPI MADT table. So
> >>>>>>>> function early_cpu_to_node() also fails to get its numa node for
> >>>>>>>> hot-added cpu, and generic function early_numa_node_init() will
> >>>>>>>> overwrite incorrect numa node.
> >>>>>>> For hot-added cpus, we will call acpi_map_cpu() -->
> >>>>>>> acpi_map_cpu2node() --> set_cpuid_to_node(), and set_cpuid_to_node()
> >>>>>>> operates on __cpuid_to_node[]. So I think early_cpu_to_node() should
> >>>>>>> be correct?
> >>>>>>
> >>>>>> __cpuid_to_node[] is correct which is physical cpuid to numa node,
> >>>>>> however cpu_logical_map(cpu) is not set. It fails to get physical cpuid
> >>>>>> from logic cpu.
> >>>>>>
> >>>>>> int early_cpu_to_node(int cpu)
> >>>>>> {
> >>>>>> int physid = cpu_logical_map(cpu);
> >>>>>>
> >>>>>> <<<<<<<<<<< Here physid is -1.
> >>>>> early_cpu_to_node() is not supposed to be called after boot, and if it
> >>>> Which calls early_cpu_to_node() after boot?
> >>>>
> >>>>> is really needed, I think a better solution is:
> >>>>>
> >>>>> diff --git a/arch/loongarch/kernel/acpi.c b/arch/loongarch/kernel/acpi.c
> >>>>> index f1a74b80f22c..998cf45fd3b7 100644
> >>>>> --- a/arch/loongarch/kernel/acpi.c
> >>>>> +++ b/arch/loongarch/kernel/acpi.c
> >>>>> @@ -311,6 +311,8 @@ static int __ref acpi_map_cpu2node(acpi_handle
> >>>>> handle, int cpu, int physid)
> >>>>>
> >>>>> nid = acpi_get_node(handle);
> >>>>> if (nid != NUMA_NO_NODE) {
> >>>>> + __cpu_number_map[physid] = cpu;
> >>>>> + __cpu_logical_map[cpu] = physid;
> >>>> This does not solve the problem. The above has been done in function
> >>>> cpu = set_processor_mask(physid, ACPI_MADT_ENABLED);
> >>>>
> >>>> static int set_processor_mask(u32 id, u32 flags)
> >>>> {
> >>>> ...
> >>>> if (flags & ACPI_MADT_ENABLED) {
> >>>> num_processors++;
> >>>> set_cpu_present(cpu, true);
> >>>> __cpu_number_map[cpuid] = cpu;
> >>>> __cpu_logical_map[cpu] = cpuid;
> >>>> }
> >>>>
> >>>> The problem is that
> >>>> smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
> >>>> <<<<<<<<<<<<<<<<
> >>>> set_cpu_numa_node() is called in function smp_prepare_boot_cpu()
> >>>>
> >>>> early_numa_node_init();
> >>>>
> >>>> static void __init early_numa_node_init(void)
> >>>> {
> >>>> #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> >>>> #ifndef cpu_to_node
> >>>> int cpu;
> >>>>
> >>>> /* The early_cpu_to_node() should be ready here. */
> >>>> for_each_possible_cpu(cpu)
> >>>> set_cpu_numa_node(cpu, early_cpu_to_node(cpu));
> >>>> <<<<<<<<<<<<<<<<
> >>>> * however here early_cpu_to_node is -1, so that cpu_to_node(cpu) will
> >>>> always return -1 in late. *, which causes cpu hotadd problem.
> >>> Still confused. For ACPI_MADT_ENABLED cpus, everything is right after
> >>> early_numa_node_init(). For !ACPI_MADT_ENABLED cpus, cpu_to_node()
> >>> returns -1 after early_numa_node_init() and before hot-add, but if
> >>> acpi_map_cpu() do things right, cpu_to_node() should still work well
> >>> after hot-add.
> >> yes, if "_PXM" information for hot-add cpu handle exist, it works well.
> >>
> >> However if "_PXM" information does not exist, it falls back to legacy
> >> method from smp_prepare_boot_cpu(). However cpu_numa_node information is
> >> overwritten with -1 by later function early_numa_node_init().
> > OK, now I finally get the key point. But no _PXM should be treated as
> > a BIOS bug, right?
> Currently if no numa information is added in qemu command line, there
> will be no "_PXM" information for hot-added cpu. Such as for this command:
> qemu-system-loongarch64 -m 4096 -smp
> 1,maxcpus=4,sockets=1,cores=4,threads=1
> >
> > From comments we can see:
> >
> > * If possible cpus > present cpus here (e.g. some possible
> > * cpus will be added by cpu-hotplug later), for possible but
> > * not present cpus, early_cpu_to_node will return NUMA_NO_NODE,
> > * and we just map them to online nodes in round-robin way.
> > * Once hotplugged, new correct mapping will be built for them.
> >
> > This means even with this patch, cpu_to_node() can return a "valid"
> > node rather than NUMA_NO_NODE, but this round-robin node is still an
> > incorrect node.
> The round-robin node is not standard, may it is copied from x86, I do
> not know how to use it however. At least SRAT tables provides numa
> information only that there is not logical cpu allocated in SRAT table
> parsing. How about something like this?
This looks better, but maybe Jianmin has a best solution?

>
> diff --git a/arch/loongarch/kernel/acpi.c b/arch/loongarch/kernel/acpi.c
> index f1a74b80f22c..bb9fdd318998 100644
> --- a/arch/loongarch/kernel/acpi.c
> +++ b/arch/loongarch/kernel/acpi.c
> @@ -310,6 +310,12 @@ static int __ref acpi_map_cpu2node(acpi_handle
> handle, int cpu, int physid)
> int nid;
>
> nid = acpi_get_node(handle);
> + /*
> + * Fall back to srat numa node information if _PXM is not provided
The comments may be wrong? I think SRAT node information is equal to _PXM.

> + */
> + if (nid != NUMA_NO_NODE)
> + nid = __cpuid_to_node[physid];
> +
> if (nid != NUMA_NO_NODE) {
Maybe this condition can be removed now?

Huacai

> set_cpuid_to_node(physid, nid);
> node_set(nid, numa_nodes_parsed);
>
> Regards
> Bibo Mao
> >
> > Huacai
> >
> >>
> >> Regards
> >> Bibo Mao
> >>>
> >>> Huacai
> >>>>
> >>>> Regards
> >>>> Bibo Mao
> >>>>
> >>>>
> >>>>> set_cpuid_to_node(physid, nid);
> >>>>> node_set(nid, numa_nodes_parsed);
> >>>>> set_cpu_numa_node(cpu, nid);
> >>>>>
> >>>>> Huacai
> >>>>>
> >>>>>>
> >>>>>> if (physid < 0)
> >>>>>> return NUMA_NO_NODE;
> >>>>>>
> >>>>>> return __cpuid_to_node[physid];
> >>>>>> }
> >>>>>>
> >>>>>> Regards
> >>>>>> Bibo Mao
> >>>>>>>
> >>>>>>> Huacai
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Here static array __cpu_to_node and api set_early_cpu_to_node()
> >>>>>>>> is added, so that early_cpu_to_node is consistent with function
> >>>>>>>> cpu_to_node() for hot-added cpu.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Bibo Mao <maobibo@xxxxxxxxxxx>
> >>>>>>>> ---
> >>>>>>>> arch/loongarch/include/asm/numa.h | 2 ++
> >>>>>>>> arch/loongarch/kernel/numa.c | 10 +++++++++-
> >>>>>>>> arch/loongarch/kernel/smp.c | 1 +
> >>>>>>>> 3 files changed, 12 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/arch/loongarch/include/asm/numa.h b/arch/loongarch/include/asm/numa.h
> >>>>>>>> index b5f9de9f102e..e8e6fcfb006a 100644
> >>>>>>>> --- a/arch/loongarch/include/asm/numa.h
> >>>>>>>> +++ b/arch/loongarch/include/asm/numa.h
> >>>>>>>> @@ -50,6 +50,7 @@ static inline void set_cpuid_to_node(int cpuid, s16 node)
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> extern int early_cpu_to_node(int cpu);
> >>>>>>>> +extern void set_early_cpu_to_node(int cpu, s16 node);
> >>>>>>>>
> >>>>>>>> #else
> >>>>>>>>
> >>>>>>>> @@ -57,6 +58,7 @@ static inline void early_numa_add_cpu(int cpuid, s16 node) { }
> >>>>>>>> static inline void numa_add_cpu(unsigned int cpu) { }
> >>>>>>>> static inline void numa_remove_cpu(unsigned int cpu) { }
> >>>>>>>> static inline void set_cpuid_to_node(int cpuid, s16 node) { }
> >>>>>>>> +static inline void set_early_cpu_to_node(int cpu, s16 node) { }
> >>>>>>>>
> >>>>>>>> static inline int early_cpu_to_node(int cpu)
> >>>>>>>> {
> >>>>>>>> diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c
> >>>>>>>> index 84fe7f854820..62508aace644 100644
> >>>>>>>> --- a/arch/loongarch/kernel/numa.c
> >>>>>>>> +++ b/arch/loongarch/kernel/numa.c
> >>>>>>>> @@ -34,6 +34,9 @@ static struct numa_meminfo numa_meminfo;
> >>>>>>>> cpumask_t cpus_on_node[MAX_NUMNODES];
> >>>>>>>> cpumask_t phys_cpus_on_node[MAX_NUMNODES];
> >>>>>>>> EXPORT_SYMBOL(cpus_on_node);
> >>>>>>>> +static s16 __cpu_to_node[NR_CPUS] = {
> >>>>>>>> + [0 ... CONFIG_NR_CPUS - 1] = NUMA_NO_NODE
> >>>>>>>> +};
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>> * apicid, cpu, node mappings
> >>>>>>>> @@ -117,11 +120,16 @@ int early_cpu_to_node(int cpu)
> >>>>>>>> int physid = cpu_logical_map(cpu);
> >>>>>>>>
> >>>>>>>> if (physid < 0)
> >>>>>>>> - return NUMA_NO_NODE;
> >>>>>>>> + return __cpu_to_node[cpu];
> >>>>>>>>
> >>>>>>>> return __cpuid_to_node[physid];
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> +void set_early_cpu_to_node(int cpu, s16 node)
> >>>>>>>> +{
> >>>>>>>> + __cpu_to_node[cpu] = node;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> void __init early_numa_add_cpu(int cpuid, s16 node)
> >>>>>>>> {
> >>>>>>>> int cpu = __cpu_number_map[cpuid];
> >>>>>>>> diff --git a/arch/loongarch/kernel/smp.c b/arch/loongarch/kernel/smp.c
> >>>>>>>> index 9afc2d8b3414..998668be858c 100644
> >>>>>>>> --- a/arch/loongarch/kernel/smp.c
> >>>>>>>> +++ b/arch/loongarch/kernel/smp.c
> >>>>>>>> @@ -512,6 +512,7 @@ void __init smp_prepare_boot_cpu(void)
> >>>>>>>> set_cpu_numa_node(cpu, node);
> >>>>>>>> else {
> >>>>>>>> set_cpu_numa_node(cpu, rr_node);
> >>>>>>>> + set_early_cpu_to_node(cpu, rr_node);
> >>>>>>>> rr_node = next_node_in(rr_node, node_online_map);
> >>>>>>>> }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> base-commit: 6485cf5ea253d40d507cd71253c9568c5470cd27
> >>>>>>>> --
> >>>>>>>> 2.39.3
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>