Re: [PATCH v5 3/4] arm64: topology: Support SMT control on ACPI based system

From: Pierre Gondois
Date: Mon Sep 02 2024 - 03:45:03 EST


Hello Yicong

On 8/30/24 11:35, Yicong Yang wrote:
On 2024/8/29 20:46, Pierre Gondois wrote:
Hello Yicong,

On 8/29/24 09:40, Yicong Yang wrote:
Hi Pierre,

On 2024/8/27 23:40, Pierre Gondois wrote:
Hello Yicong,
IIUC we are looking for the maximum number of threads a CPU can have
in the platform. But is is actually possible to have a platform with
CPUs not having the same number of threads ?


I was thinking of the heterogenous platforms, for example part of the
cores have SMT and others don't (I don't have any though, but there
should be some such platforms for arm64).

For instance kernel/cpu.c::cpu_smt_num_threads_valid() will check
that the number of threads is either 1 or cpu_smt_max_threads (as
CONFIG_SMT_NUM_THREADS_DYNAMIC is not enabled for arm64). However
a (hypothetical) platform with:
- CPU0: 2 threads
- CPU1: 4 threads
should not be able to set the number of threads to 2, as
1 < 2 < cpu_smt_max_threads (cf. cpu_smt_num_threads_valid()).


It's a bit more complex. If we enable/disable the SMT using on/off control
then we won't have this problem. We'll online all the CPUs on enabling and
offline CPUs which is not primary thread and don't care about the thread
number of each core.

Control using thread number is introduced by CONFIG_SMT_NUM_THREADS_DYNAMIC
and only enabled on powerpc. I think this interface is not enough to handle
the hypothetical we assumed, since it assumes the homogenous platform that
all the CPUs have the same thread number. But this should be fine for
the platforms with SMT(with same thread number) and non-SMT cores, since it do
indicates the real thread number of the SMT cores.

So if there is an assumption that all the CPUs have the same number of
threads, it should be sufficient to count the number of threads for one
CPU right ?


Naturally and conveniently I may think use of the threads number of CPU0 (boot
cpu) in such a solution. But this will be wrong if CPU0 is non-SMT on a heterogenous
platform :(

In the other (unlikely) case (i.e. CPUs can have various number of threads),
I think we should either:
- use your method and check that all the CPUs have the same number of threads
- get the number of threads for one CPU and check that all the CPUs are
    identical using the ACPI_PPTT_ACPI_IDENTICAL tag

I think this won't be simpler since we still need to iterate all the CPUs to see
if they have the same hetero_id (checked how we're using this ACPI tag in
arm_acpi_register_pmu_device()). We could already know if they're identical by
comparing the thread number and do the update if necessary.

- have a per-cpu cpu_smt_max_threads value ?

This should be unnecessary in currently stage if there's no platforms
with several kind cores have different thread number like in your assumption
and enable CONFIG_SMT_NUM_THREADS_DYNAMIC on such platforms. Otherwise using
a global cpu_smt_max_threads to enable the SMT control should be enough.
Does it make sense?

Yes, I agree with all the things you said:
- the current smt/control interface cannot handle asymmetric SMT platforms
- I am not aware if such platform exist so far

I think it would still be good to check all the CPUs have the same number of
threads. I tried to enable the SMT_NUM_THREADS_DYNAMIC Kconfig with the
patch attached (and to check CPUs have the same number of threads). Feel free
to take what you like/ignore what is inside the attached patch, or let me know
if I should submit a part in a separate patchset,


Checked the changes, we can make this series as the basic support and a separate
series of yours as a further support of SMT control on arm64:
- support thread_id on ACPI based arm64 platform
- support partial SMT control by enable CONFIG_SMT_NUM_THREADS_DYNAMIC

I'm not sure I fully understand what you mean. I can send patches to
enable SMT_NUM_THREADS_DYNAMIC on top of a v6 of yours IIUC. I let you pick
the changes that you estimate to make more sense in your serie (if that makes
sense) ? Please let met know if that works for you (or not).


some minor comments below.

----------------------------

    [RFC] arm64: topology: Enable CONFIG_SMT_NUM_THREADS_DYNAMIC
        - On arm64 ACPI systems, change the thread_id assignment to have
      increasing values starting from 0. This is already the case for DT
      based systems. Doing so allows to uniformly identify the n-th
      thread of a given CPU.
    - Check that all CPUs have the same number of threads (for DT/ACPI)
    - Enable CONFIG_SMT_NUM_THREADS_DYNAMIC
        On a Tx2, with 256 CPUs, threads siblings being 0,32,64,96
    for socket0 and 128 + (0,32,64,96) for socket1:
    $ cd /sys/devices/system/cpu/smt/
    $ cat ../online
    0-255
    $ echo 2 > control
    $ cat ../online
    0-63,128-191
    $ echo 3 > control
    $ cat ../online
    0-95,128-223
    $ echo on > control
    $ cat ../online
    0-255


So it's a real SMT4 system, it does make sense to have this partial SMT control
support.

Yes right


diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index bd3bc2f5e0ec..1d8521483065 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -239,6 +239,7 @@ config ARM64
        select HAVE_GENERIC_VDSO
        select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select HOTPLUG_SMT if (SMP && HOTPLUG_CPU)
+       select SMT_NUM_THREADS_DYNAMIC if HOTPLUG_SMT
        select IRQ_DOMAIN
        select IRQ_FORCED_THREADING
        select KASAN_VMALLOC if KASAN
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index 0f6ef432fb84..7dd211f81687 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -39,6 +39,14 @@ void update_freq_counters_refs(void);
 #define arch_scale_hw_pressure topology_get_hw_pressure
 #define arch_update_hw_pressure        topology_update_hw_pressure
+#ifdef CONFIG_SMT_NUM_THREADS_DYNAMIC
+#include <linux/cpu_smt.h>
+static inline bool topology_smt_thread_allowed(unsigned int cpu)
+{
+       return topology_thread_id(cpu) < cpu_smt_num_threads;
+}
+#endif
+
 #include <asm-generic/topology.h>
 #endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index f72e1e55b05e..a83babe19972 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -47,7 +47,9 @@ int __init parse_acpi_topology(void)
 {
        int thread_num, max_smt_thread_num = 1;
        struct xarray core_threads;
+       bool have_thread = false;
        int cpu, topology_id;
+       unsigned long i;
        void *entry;
        if (acpi_disabled)
@@ -61,6 +63,8 @@ int __init parse_acpi_topology(void)
                        return topology_id;
                if (acpi_cpu_is_threaded(cpu)) {
+                       have_thread = true;
+
                        cpu_topology[cpu].thread_id = topology_id;
                        topology_id = find_acpi_cpu_topology(cpu, 1);
                        cpu_topology[cpu].core_id   = topology_id;
@@ -69,9 +73,10 @@ int __init parse_acpi_topology(void)
                        if (!entry) {
                                xa_store(&core_threads, topology_id,
                                         xa_mk_value(1), GFP_KERNEL);
+                               cpu_topology[cpu].thread_id = 0;
                        } else {
                                thread_num = xa_to_value(entry);
-                               thread_num++;
+                               cpu_topology[cpu].thread_id = thread_num++;
                                xa_store(&core_threads, topology_id,
                                         xa_mk_value(thread_num), GFP_KERNEL);
@@ -86,8 +91,17 @@ int __init parse_acpi_topology(void)
                cpu_topology[cpu].cluster_id = topology_id;
                topology_id = find_acpi_cpu_topology_package(cpu);
                cpu_topology[cpu].package_id = topology_id;
+
+               pr_debug("CPU%u: package=0x%x cluster=0x%x core=0x%x thread=0x%x\n",
+                       cpu, cpu_topology[cpu].package_id, cpu_topology[cpu].cluster_id,
+                       cpu_topology[cpu].core_id, cpu_topology[cpu].thread_id);
        }
+       if (have_thread)

we could know this from max_smt_thread_num so have_thread maybe unnecessary.

Yes right, I will change that.


+               xa_for_each(&core_threads, i, entry)
+                       if (xa_to_value(entry) != max_smt_thread_num)
+                               pr_warn("Heterogeneous SMT topology not handled");\

Wondering if we can avoid this 2nd loop. Greg express the worries of looping twice on large scale
system in v1. Maybe we could use the hetero_id and get the necessary information in one loop, I need
further think.

I found this comments (not sure this is what you are refering to):
- https://lore.kernel.org/linux-arm-kernel/20231011103303.00002d8f@xxxxxxxxxx/
- https://lore.kernel.org/all/20230921150333.c2zqigs3xxwcg4ln@bogus/T/#m406c4c16871ca7ae431beb20feccfb5e14498452

I don't see another way to do it right now. Also, I thing the complexity is in
O(2n), which should be better than the original O(n**2),

Regards,
Pierre


Thanks.

+
        cpu_smt_set_num_threads(max_smt_thread_num, max_smt_thread_num);
        xa_destroy(&core_threads);
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 95513abd664f..20d7f5b72ddd 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -532,13 +532,15 @@ static int __init get_cpu_for_node(struct device_node *node)
        return cpu;
 }
-static void __init update_smt_num_threads(unsigned int num_threads)
+static void __init update_smt_num_threads(int num_threads)
 {
-       static unsigned int max_smt_thread_num = 1;
+       static int max_smt_thread_num = -1;
-       if (num_threads > max_smt_thread_num) {
+       if (max_smt_thread_num < 0) {
                max_smt_thread_num = num_threads;
                cpu_smt_set_num_threads(max_smt_thread_num, max_smt_thread_num);
+       } else if (num_threads != max_smt_thread_num) {
+               pr_warn("Heterogeneous SMT topology not handled");
        }
 }
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index b721f360d759..afdfdc64a0a1 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -87,6 +87,7 @@ extern struct cpu_topology cpu_topology[NR_CPUS];
 #define topology_physical_package_id(cpu)      (cpu_topology[cpu].package_id)
 #define topology_cluster_id(cpu)       (cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)          (cpu_topology[cpu].core_id)
+#define topology_thread_id(cpu)                (cpu_topology[cpu].thread_id)
 #define topology_core_cpumask(cpu)     (&cpu_topology[cpu].core_sibling)
 #define topology_sibling_cpumask(cpu)  (&cpu_topology[cpu].thread_sibling)
 #define topology_cluster_cpumask(cpu)  (&cpu_topology[cpu].cluster_sibling)

----------------------------


Regards,
Pierre


Thanks,
Yicong


Same comment for the DT patch. If there is an assumption that all CPUs have
the same number of threads, then update_smt_num_threads() could only be called
once I suppose,

Regards,
Pierre


On 8/6/24 10:53, Yicong Yang wrote:
From: Yicong Yang <yangyicong@xxxxxxxxxxxxx>

For ACPI we'll build the topology from PPTT and we cannot directly
get the SMT number of each core. Instead using a temporary xarray
to record the SMT number of each core when building the topology
and we can know the largest SMT number in the system. Then we can
enable the support of SMT control.

Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
---
   arch/arm64/kernel/topology.c | 24 ++++++++++++++++++++++++
   1 file changed, 24 insertions(+)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 1a2c72f3e7f8..f72e1e55b05e 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -15,8 +15,10 @@
   #include <linux/arch_topology.h>
   #include <linux/cacheinfo.h>
   #include <linux/cpufreq.h>
+#include <linux/cpu_smt.h>
   #include <linux/init.h>
   #include <linux/percpu.h>
+#include <linux/xarray.h>
     #include <asm/cpu.h>
   #include <asm/cputype.h>
@@ -43,11 +45,16 @@ static bool __init acpi_cpu_is_threaded(int cpu)
    */
   int __init parse_acpi_topology(void)
   {
+    int thread_num, max_smt_thread_num = 1;
+    struct xarray core_threads;
       int cpu, topology_id;
+    void *entry;
         if (acpi_disabled)
           return 0;
   +    xa_init(&core_threads);
+
       for_each_possible_cpu(cpu) {
           topology_id = find_acpi_cpu_topology(cpu, 0);
           if (topology_id < 0)
@@ -57,6 +64,20 @@ int __init parse_acpi_topology(void)
               cpu_topology[cpu].thread_id = topology_id;
               topology_id = find_acpi_cpu_topology(cpu, 1);
               cpu_topology[cpu].core_id   = topology_id;
+
+            entry = xa_load(&core_threads, topology_id);
+            if (!entry) {
+                xa_store(&core_threads, topology_id,
+                     xa_mk_value(1), GFP_KERNEL);
+            } else {
+                thread_num = xa_to_value(entry);
+                thread_num++;
+                xa_store(&core_threads, topology_id,
+                     xa_mk_value(thread_num), GFP_KERNEL);
+
+                if (thread_num > max_smt_thread_num)
+                    max_smt_thread_num = thread_num;
+            }
           } else {
               cpu_topology[cpu].thread_id  = -1;
               cpu_topology[cpu].core_id    = topology_id;
@@ -67,6 +88,9 @@ int __init parse_acpi_topology(void)
           cpu_topology[cpu].package_id = topology_id;
       }
   +    cpu_smt_set_num_threads(max_smt_thread_num, max_smt_thread_num);
+
+    xa_destroy(&core_threads);
       return 0;
   }
   #endif

.