Re: aarch64 ACPI boot regressed by commit 7ba5f605f3a0 ("arm64/numa: remove the limitation that cpu0 must bind to node0")

From: Laszlo Ersek
Date: Fri Oct 14 2016 - 11:01:56 EST


On 10/14/16 15:18, Laszlo Ersek wrote:
> On 10/14/16 10:05, Andrew Jones wrote:
>> On Fri, Oct 14, 2016 at 12:50:29AM +0200, Laszlo Ersek wrote:
>>> (4) Analysis (well, a lame attempt at that, because I have zero
>>> familiarity with this code). Let me quote the patch:
>>>
>>>> commit 7ba5f605f3a0d9495aad539eeb8346d726dfc183
>>>> Author: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
>>>> Date: Thu Sep 1 14:55:04 2016 +0800
>>>>
>>>> arm64/numa: remove the limitation that cpu0 must bind to node0
>>>>
>>>> 1. Remove the old binding code.
>>>> 2. Read the nid of cpu0 from dts.
>>>> 3. Fallback the nid of cpu0 to 0 when numa=off is set in bootargs.
>>>>
>>>> Signed-off-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
>>>> Signed-off-by: Will Deacon <will.deacon@xxxxxxx>
>>>>
>>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>>> index c3c08368a685..8b048e6ec34a 100644
>>>> --- a/arch/arm64/kernel/smp.c
>>>> +++ b/arch/arm64/kernel/smp.c
>>>> @@ -624,6 +624,7 @@ static void __init of_parse_and_init_cpus(void)
>>>> }
>>>>
>>>> bootcpu_valid = true;
>>>> + early_map_cpu_to_node(0, of_node_to_nid(dn));
>>>>
>>>> /*
>>>> * cpu_logical_map has already been
>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>> index 0a15f010b64a..778a985c8a70 100644
>>>> --- a/arch/arm64/mm/numa.c
>>>> +++ b/arch/arm64/mm/numa.c
>>>> @@ -116,16 +116,24 @@ static void __init setup_node_to_cpumask_map(void)
>>>> */
>>>> void numa_store_cpu_info(unsigned int cpu)
>>>> {
>>>> - map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>>>> + map_cpu_to_node(cpu, cpu_to_node_map[cpu]);
>>>> }
>>>>
>>>> void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>> {
>>>> /* fallback to node 0 */
>>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>>> nid = 0;
>>
>> The ACPI equivalent code must be missing (at least) the above,
>> because, even with DT, mach-virt won't have cpu to node mappings
>> unless numa is configured on the command line. Can you try adding
>> something like
>>
>> -m 512 -smp 4 \
>> -numa node,mem=256M,cpus=0-1,nodeid=0 \
>> -numa node,mem=256M,cpus=2-3,nodeid=1
>>
>> to your QEMU command line?
>
> I added the following to my domain XML, under <cpu>:
>
> <numa>
> <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
> <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/>
> </numa>
>
> (See <http://libvirt.org/formatdomain.html#elementsCPU>.)
>
> With that, each NUMA node gets half of the VCPUs and half of the guest
> RAM.
>
> (This is in a different guest now, one that has a bleeding edge Fedora
> kernel -- I didn't want to rebuild the upstream kernel yet again, just
> for this test. So, "4.9.0-0.rc0.git7.1.fc26.aarch64" is based on
> upstream v4.8-14109-g1573d2c, and it reproduces the problem too.)
>
>> Then when you boot with ACPI you'll get a
>> SRAT.
>
> Yes, that's confirmed by the guest kernel log (see below).
>
>> If that works, then we're just missing the "no SRAT, nid = 0"
>> code (that should have been added with this patch)
>
> It still crashes with the SRAT, with the following log:
>
>> EFI stub: Booting Linux Kernel...
>> ConvertPages: Incompatible memory types
>> EFI stub: Using DTB from configuration table
>> EFI stub: Exiting boot services and installing virtual address map...
>> [ 0.000000] Booting Linux on physical CPU 0x0
>> [ 0.000000] Linux version 4.9.0-0.rc0.git7.1.fc26.aarch64 (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 6.2.1 20160916 (Red Hat 6.2.1-2) (GCC) ) #1 SMP Wed Oct 12 17:44:54 UTC 2016
>> [ 0.000000] Boot CPU: AArch64 Processor [500f0000]
>> [ 0.000000] efi: Getting EFI parameters from FDT:
>> [ 0.000000] efi: EFI v2.60 by EDK II
>> [ 0.000000] efi: SMBIOS 3.0=0xbbdb0000 ACPI 2.0=0xb86d0000 MEMATTR=0xb936b018
>> [ 0.000000] cma: Reserved 512 MiB at 0x00000000e0000000
>> [ 0.000000] ACPI: Early table checksum verification disabled
>> [ 0.000000] ACPI: RSDP 0x00000000B86D0000 000024 (v02 BOCHS )
>> [ 0.000000] ACPI: XSDT 0x00000000B86C0000 000054 (v01 BOCHS BXPCFACP 00000001 01000013)
>> [ 0.000000] ACPI: FACP 0x00000000B83E0000 00010C (v05 BOCHS BXPCFACP 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: DSDT 0x00000000B83F0000 0010E5 (v02 BOCHS BXPCDSDT 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: APIC 0x00000000B83D0000 00018C (v03 BOCHS BXPCAPIC 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: GTDT 0x00000000B83C0000 000060 (v02 BOCHS BXPCGTDT 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: MCFG 0x00000000B83B0000 00003C (v01 BOCHS BXPCMCFG 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: SPCR 0x00000000B83A0000 000050 (v02 BOCHS BXPCSPCR 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: SRAT 0x00000000B8390000 0000C8 (v03 BOCHS BXPCSRAT 00000001 BXPC 00000001)
>> [ 0.000000] ACPI: SPCR: console: pl011,mmio,0x9000000,9600
>> [ 0.000000] earlycon: pl11 at MMIO 0x0000000009000000 (options '9600')
>> [ 0.000000] bootconsole [pl11] enabled
>> [ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x0 -> Node 0
>> [ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x1 -> Node 0
>> [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x2 -> Node 1
>> [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x3 -> Node 1
>> [ 0.000000] NUMA: Adding memblock [0x40000000 - 0xbfffffff] on node 0
>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x40000000-0xbfffffff]
>> [ 0.000000] NUMA: Adding memblock [0xc0000000 - 0x13fffffff] on node 1
>> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0xc0000000-0x13fffffff]
>> [ 0.000000] NUMA: Initmem setup node 0 [mem 0x40000000-0xbfffffff]
>> [ 0.000000] NUMA: NODE_DATA [mem 0xbfff2580-0xbfffffff]
>> [ 0.000000] NUMA: Initmem setup node 1 [mem 0xc0000000-0x13fffffff]
>> [ 0.000000] NUMA: NODE_DATA [mem 0x13fff2580-0x13fffffff]
>> [ 0.000000] Zone ranges:
>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff]
>> [ 0.000000] Normal [mem 0x0000000100000000-0x000000013fffffff]
>> [ 0.000000] Movable zone start for each node
>> [ 0.000000] Early memory node ranges
>> [ 0.000000] node 0: [mem 0x0000000040000000-0x00000000b838ffff]
>> [ 0.000000] node 0: [mem 0x00000000b8390000-0x00000000b83fffff]
>> [ 0.000000] node 0: [mem 0x00000000b8400000-0x00000000b841ffff]
>> [ 0.000000] node 0: [mem 0x00000000b8420000-0x00000000b874ffff]
>> [ 0.000000] node 0: [mem 0x00000000b8750000-0x00000000bbc1ffff]
>> [ 0.000000] node 0: [mem 0x00000000bbc20000-0x00000000bbffffff]
>> [ 0.000000] node 0: [mem 0x00000000bc000000-0x00000000bfffffff]
>> [ 0.000000] node 1: [mem 0x00000000c0000000-0x000000013fffffff]
>> [ 0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x00000000bfffffff]
>> [ 0.000000] Initmem setup node 1 [mem 0x00000000c0000000-0x000000013fffffff]
>> [ 0.000000] psci: probing for conduit method from ACPI.
>> [ 0.000000] psci: PSCIv0.2 detected in firmware.
>> [ 0.000000] psci: Using standard PSCI v0.2 function IDs
>> [ 0.000000] psci: Trusted OS migration not required
>> [ 0.000000] percpu: Embedded 3 pages/cpu @fffffe007fda0000 s117832 r8192 d70584 u196608
>> [ 0.000000] Detected PIPT I-cache on CPU0
>> [ 0.000000] Built 2 zonelists in Node order, mobility grouping on. Total pages: 65472
>> [ 0.000000] Policy zone: Normal
>> [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.9.0-0.rc0.git7.1.fc26.aarch64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap LANG=en_US.UTF-8 earlycon acpi=force
>> [ 0.000000] PID hash table entries: 4096 (order: -1, 32768 bytes)
>> [ 0.000000] software IO TLB [mem 0xdbff0000-0xdfff0000] (64MB) mapped at [fffffe009bff0000-fffffe009ffeffff]
>> [ 0.000000] Memory: 3542976K/4194304K available (9148K kernel code, 1612K rwdata, 3776K rodata, 1600K init, 15899K bss, 127040K reserved, 524288K cma-reserved)
>> [ 0.000000] Virtual kernel memory layout:
>> [ 0.000000] modules : 0xfffffc0000000000 - 0xfffffc0008000000 ( 128 MB)
>> vmalloc : 0xfffffc0008000000 - 0xfffffdff5fff0000 ( 2045 GB)
>> .text : 0xfffffc0008080000 - 0xfffffc0008970000 ( 9152 KB)
>> .rodata : 0xfffffc0008970000 - 0xfffffc0008d30000 ( 3840 KB)
>> .init : 0xfffffc0008d30000 - 0xfffffc0008ec0000 ( 1600 KB)
>> .data : 0xfffffc0008ec0000 - 0xfffffc0009053200 ( 1613 KB)
>> .bss : 0xfffffc0009053200 - 0xfffffc0009fda058 ( 15900 KB)
>> fixed : 0xfffffdff7e7d0000 - 0xfffffdff7ec00000 ( 4288 KB)
>> PCI I/O : 0xfffffdff7ee00000 - 0xfffffdff7fe00000 ( 16 MB)
>> vmemmap : 0xfffffdff80000000 - 0xfffffe0000000000 ( 2 GB maximum)
>> 0xfffffdff80000000 - 0xfffffdff80400000 ( 4 MB actual)
>> memory : 0xfffffe0000000000 - 0xfffffe0100000000 ( 4096 MB)
>> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=2
>> [ 0.000000] Running RCU self tests
>> [ 0.000000] Hierarchical RCU implementation.
>> [ 0.000000] RCU lockdep checking is enabled.
>> [ 0.000000] Build-time adjustment of leaf fanout to 64.
>> [ 0.000000] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
>> [ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=4
>> [ 0.000000] kmemleak: Kernel memory leak detector disabled
>> [ 0.000000] NR_IRQS:64 nr_irqs:64 0
>> [ 0.000000] GICv2m: ACPI overriding V2M MSI_TYPER (base:80, num:64)
>> [ 0.000000] GICv2m: range[mem 0x08020000-0x08020fff], SPI[80:143]
>> [ 0.000000] GIC: PPI11 is secure or misconfigured
>> [ 0.000000] arm_arch_timer: WARNING: Invalid trigger for IRQ3, assuming level low
>> [ 0.000000] arm_arch_timer: WARNING: Please fix your firmware
>> [ 0.000000] arm_arch_timer: Architected cp15 timer(s) running at 50.00MHz (virt).
>> [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0xb8812736b, max_idle_ns: 440795202655 ns
>> [ 0.000003] sched_clock: 56 bits at 50MHz, resolution 20ns, wraps every 4398046511100ns
>> [ 0.002198] Console: colour dummy device 80x25
>> [ 0.003319] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
>> [ 0.005236] ... MAX_LOCKDEP_SUBCLASSES: 8
>> [ 0.006183] ... MAX_LOCK_DEPTH: 48
>> [ 0.007273] ... MAX_LOCKDEP_KEYS: 8191
>> [ 0.008287] ... CLASSHASH_SIZE: 4096
>> [ 0.009296] ... MAX_LOCKDEP_ENTRIES: 32768
>> [ 0.010327] ... MAX_LOCKDEP_CHAINS: 65536
>> [ 0.011318] ... CHAINHASH_SIZE: 32768
>> [ 0.012453] memory used by lock dependency info: 8159 kB
>> [ 0.013736] per task-struct memory footprint: 1920 bytes
>> [ 0.015742] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>> [ 0.018710] Calibrating delay loop (skipped), value calculated using timer frequency.. 100.00 BogoMIPS (lpj=50000)
>> [ 0.021221] pid_max: default: 32768 minimum: 301
>> [ 0.022806] ACPI: Core revision 20160831
>> [ 0.027885] ACPI: 1 ACPI AML tables successfully acquired and loaded
>>
>> [ 0.030252] Security Framework initialized
>> [ 0.031355] Yama: becoming mindful.
>> [ 0.032176] SELinux: Initializing.
>> [ 0.033925] Dentry cache hash table entries: 524288 (order: 6, 4194304 bytes)
>> [ 0.037039] Inode-cache hash table entries: 262144 (order: 5, 2097152 bytes)
>> [ 0.039383] Mount-cache hash table entries: 8192 (order: 0, 65536 bytes)
>> [ 0.041135] Mountpoint-cache hash table entries: 8192 (order: 0, 65536 bytes)
>> [ 0.044725] ftrace: allocating 29596 entries in 8 pages
>> [ 0.080467] ASID allocator initialised with 65536 entries
>> [ 0.082070] ------------[ cut here ]------------
>> [ 0.083227] WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:5458 wq_numa_init+0x178/0x21c
>> [ 0.085304] Modules linked in:
>> [ 0.086102]
>> [ 0.086499] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-0.rc0.git7.1.fc26.aarch64 #1
>> [ 0.088611] Hardware name: linux,dummy-virt (DT)
>> [ 0.089816] task: fffffe00700aac00 task.stack: fffffe00f8044000
>> [ 0.091375] PC is at wq_numa_init+0x178/0x21c
>> [ 0.092514] LR is at wq_numa_init+0x14c/0x21c
>> [ 0.093654] pc : [<fffffc0008d3f434>] lr : [<fffffc0008d3f408>] pstate: 60000045
>> [ 0.095589] sp : fffffe00f8047cb0
>> [ 0.096457] x29: fffffe00f8047cb0 [ 0.097311] x28: 0000000000000000
>> [ 0.098201]
>> [ 0.098601] x27: 0000000000000000 [ 0.099450] x26: fffffc0008ef4a28
>> [ 0.100342]
>> [ 0.100730] x25: fffffc0008ef3000 [ 0.101576] x24: fffffc0008ef3574
>> [ 0.102466]
>> [ 0.102853] x23: 0000000000000000 [ 0.103700] x22: fffffe007937de00
>> [ 0.104593]
>> [ 0.104982] x21: fffffc0008e887f8 [ 0.105829] x20: fffffc0009091000
>> [ 0.106723]
>> [ 0.107111] x19: 0000000000000000 [ 0.107956] x18: 0000000050642c6a
>> [ 0.108847]
>> [ 0.109234] x17: 0000000000000000 [ 0.110078] x16: 0000000000000000
>> [ 0.110968]
>> [ 0.111363] x15: 00000000fcacdc89 [ 0.112199] x14: 0000000000000000
>> [ 0.113087]
>> [ 0.113481] x13: 0000000000000000 [ 0.114324] x12: 00000000fe2ce6e0
>> [ 0.115204]
>> [ 0.115597] x11: 0000000000000001 [ 0.116439] x10: 0000000000000048
>> [ 0.117328]
>> [ 0.117716] x9 : 0000000000000000 [ 0.118563] x8 : fffffe00f4010080
>> [ 0.119453]
>> [ 0.119833] x7 : 0000000000000000 [ 0.120678] x6 : 0000000000000000
>> [ 0.121571]
>> [ 0.121959] x5 : 000000000000000f [ 0.122804] x4 : 0000000000000000
>> [ 0.123695]
>> [ 0.124084] x3 : 0000000000000000 [ 0.124922] x2 : 0000000000000000
>> [ 0.125815]
>> [ 0.126204] x1 : 0000000000000004 [ 0.127055] x0 : 00000000ffffffff
>> [ 0.127966]
>> [ 0.128361]
>> [ 0.128767] ---[ end trace 0000000000000000 ]---
>> [ 0.129983] Call trace:
>> [ 0.130629] Exception stack(0xfffffe00f8047ad0 to 0xfffffe00f8047c00)
>> [ 0.132316] 7ac0: 0000000000000000 0000040000000000
>> [ 0.134360] 7ae0: fffffe00f8047cb0 fffffc0008d3f434 0000000060000045 000000000000003d
>> [ 0.136405] 7b00: fffffc0008ef4000 fffffe007937df00 0000000000000000 0000000000000000
>> [ 0.138446] 7b20: fffffc0008bf4110 0000000000000189 0000000000000018 0000000000000028
>> [ 0.140498] 7b40: fffffe00f8047b80 0000000000000000 fffffe0000000000 fffffc000848af30
>> [ 0.142541] 7b60: fffffe00f8047ba0 fffffc0008134d24 fffffe00f8044000 0000000000000040
>> [ 0.144558] 7b80: 00000000ffffffff 0000000000000004 0000000000000000 0000000000000000
>> [ 0.146607] 7ba0: 0000000000000000 000000000000000f 0000000000000000 0000000000000000
>> [ 0.148664] 7bc0: fffffe00f4010080 0000000000000000 0000000000000048 0000000000000001
>> [ 0.150704] 7be0: 00000000fe2ce6e0 0000000000000000 0000000000000000 00000000fcacdc89
>> [ 0.152752] [<fffffc0008d3f434>] wq_numa_init+0x178/0x21c
>> [ 0.154160] [<fffffc0008d3f578>] init_workqueues+0xa0/0x4b8
>> [ 0.155596] [<fffffc0008083594>] do_one_initcall+0x44/0x138
>> [ 0.157059] [<fffffc0008d30d28>] kernel_init_freeable+0x178/0x2dc
>> [ 0.158670] [<fffffc0008956f48>] kernel_init+0x18/0x110
>> [ 0.160036] [<fffffc0008083330>] ret_from_fork+0x10/0x20
>> [ 0.161440] workqueue: NUMA node mapping not available for cpu0, disabling NUMA support
>> [ 0.165296] Remapping and enabling EFI services.
>> [ 0.166586] Unable to handle kernel paging request at virtual address b91000006be8
>> [ 0.168448] pgd = fffffc000a010000
>> [ 0.169341] [b91000006be8] *pgd=0000000000000000[ 0.170505] , *pud=0000000000000000
>> , *pmd=0000000000000000[ 0.171942]
>> [ 0.172332] Internal error: Oops: 96000004 [#1] SMP
>> [ 0.173600] Modules linked in:
>> [ 0.174407] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.9.0-0.rc0.git7.1.fc26.aarch64 #1
>> [ 0.176836] Hardware name: linux,dummy-virt (DT)
>> [ 0.178038] task: fffffe00700aac00 task.stack: fffffe00f8044000
>> [ 0.179579] PC is at __ll_sc_atomic_add+0x20/0x40
>> [ 0.180800] LR is at __lock_acquire+0xe8/0x698
>> [ 0.181961] pc : [<fffffc0008487390>] lr : [<fffffc0008138c08>] pstate: 800000c5
>> [ 0.183895] sp : fffffe00f8047820
>> [ 0.184755] x29: fffffe00f8047820 [ 0.185588] x28: fffffc0008ef3000
>> [ 0.186479]
>> [ 0.186868] x27: fffffc0008ef2358 [ 0.187713] x26: fffffc0009ce6000
>> [ 0.188606]
>> [ 0.188997] x25: 0000000000000001 [ 0.189857] x24: 0000000000000000
>> [ 0.190731]
>> [ 0.191115] x23: fffffe00700aac00 [ 0.191951] x22: 0000000000000000
>> [ 0.192843]
>> [ 0.193231] x21: fffffe007fd9a018 [ 0.194074] x20: 0000000000000000
>> [ 0.194966]
>> [ 0.195361] x19: fffffe007fd9a018 [ 0.196192] x18: 0000000000000010
>> [ 0.197077]
>> [ 0.197476] x17: 0000000057181979 [ 0.198325] x16: 0000000000000000
>> [ 0.199209]
>> [ 0.199604] x15: 0000000000000000 [ 0.200450] x14: 0000000000000000
>> [ 0.201337]
>> [ 0.201723] x13: 0000000000000001 [ 0.202555] x12: fffffe007fff2580
>> [ 0.203432]
>> [ 0.203819] x11: 0000000000000000 [ 0.204664] x10: 0000000000000011
>> [ 0.205550]
>> [ 0.205937] x9 : 0000000000000001 [ 0.206784] x8 : 0000b91000006be8
>> [ 0.207678]
>> [ 0.208062] x7 : fffffc0008299fcc [ 0.208899] x6 : 0000000000000000
>> [ 0.209787]
>> [ 0.210176] x5 : 0000000000000080 [ 0.211022] x4 : 0000b91000006a50
>> [ 0.211913]
>> [ 0.212307] x3 : 0000000000000000 [ 0.213147] x2 : 000022c80000f420
>> [ 0.214034]
>> [ 0.214421] x1 : 0000b91000006be8 [ 0.215251] x0 : fffffc0008138c08
>> [ 0.216134]
>> [ 0.216527]
>> [ 0.216916] Process swapper/0 (pid: 1, stack limit = 0xfffffe00f8044020)
>> [ 0.218671] Stack: (0xfffffe00f8047820 to 0xfffffe00f8048000)
>> [ 0.220167] 7820: fffffe00f8047840 fffffc0008138c08 fffffe00f8044000 0000000000000001
>> [ 0.222190] 7840: fffffe00f80478c0 fffffc0008139590 fffffe007fd9a018 0000000000000000
>> [ 0.224238] 7860: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
>> [ 0.226284] 7880: fffffc0008299fcc 00000000000000c0 fffffc0008ef2358 fffffc0008ef3000
>> [ 0.228318] 78a0: 0000000000000001 fffffc0009ce6000 0000000000000000 fffffe0000000000
>> [ 0.230362] 78c0: fffffe00f8047930 fffffc000895f2c4 fffffe007fd9a000 fffffc0008299fcc
>> [ 0.232394] 78e0: fffffe007fd9a000 fffffc000829ad94 fffffe007001db00 000000000000e8e8
>> [ 0.234435] 7900: fffffe007001db00 fffffe007001dbf8 fffffe00fff3ef50 0000000000000000
>> [ 0.236481] 7920: fffffe00f8047a20 fffffc0008ef2000 fffffe00f8047950 fffffc0008299fcc
>> [ 0.238516] 7940: 00000000ffffffff fffffe007fd9a000 fffffe00f8047a70 fffffc000829aa68
>> [ 0.240560] 7960: 00000000ffffffff 0000000000000001 00000000024000c0 fffffc000829ad94
>> [ 0.242604] 7980: 0000000000210d00 000000000000e8e8 fffffe007001db00 fffffe007001dbf8
>> [ 0.244634] 79a0: fffffe00fff3ef50 0000000000000000 fffffe00f8044000 0000000000000040
>> [ 0.246678] 79c0: fffffc000828d620 fffffc0008ef3000 00000000026080c0 fffffe00fff3ef60
>> [ 0.248733] 79e0: fffffe00f8047a00 fffffc00024000c0 fffffc0008f89000 0000000000000000
>> [ 0.250783] 7a00: fffffe00f8047a20 fffffc000822f62c fffffc0009016b30 fffffe00f8047b40
>> [ 0.252896] 7a20: fffffe00f8047ba0 fffffc000828d620 0000000000000000 fffffc0008ef0b28
>> [ 0.255009] 7a40: fffffe007fff3c00 0000000000000000 0000000000000000 0000000000000000
>> [ 0.257121] 7a60: fffffe00f8044000 0000000000000000 fffffe00f8047b90 fffffc000829ad94
>> [ 0.259240] 7a80: 0000000000000040 fffffe007001db00 00000000024000c0 00000000ffffffff
>> [ 0.261358] 7aa0: fffffc0008266284 fffffe00fff3ef50 0000000020000000 00e8000000000f07
>> [ 0.263472] 7ac0: 0000000000000000 0000000000000400 fffffc0008f89000 0000000000000000
>> [ 0.265662] 7ae0: fffffe00f8047b00 fffffc000822f62c fffffe00fff3ef60 0000000000000000
>> [ 0.267787] 7b00: 0000001000000000 fffffc0008266284 fffffe00f8047b50 fffffc0008134d24
>> [ 0.269905] 7b20: fffffe00f8044000 0000000000000040 fffffc0008bf4110 0000000000000189
>> [ 0.272020] 7b40: fffffc0008ef4000 0000000000000000 fffffe00f8047b70 fffffc000810267c
>> [ 0.274136] 7b60: fffffc0009016893 0000000000000000 fffffe00f8047ba0 fffffc0008102784
>> [ 0.276250] 7b80: fffffe00f8047b90 fffffc000829ad7c fffffe00f8047bd0 fffffc000829b13c
>> [ 0.278371] 7ba0: fffffe007001db00 00000000024000c0 fffffc0008266284 fffffe007001db00
>> [ 0.280484] 7bc0: fffffc0008ef4000 0000000000000000 fffffe00f8047c30 fffffc0008266284
>> [ 0.282600] 7be0: fffffdff801b0200 fffffe006c080000 000000006c080000 0000000020000000
>> [ 0.284715] 7c00: fffffe00f0010008 0000000004000000 0000000020000000 00e8000000000f07
>> [ 0.286831] 7c20: 0000000000000000 0000000000000000 fffffe00f8047c50 fffffc0008098e24
>> [ 0.288948] 7c40: fffffdff801b0200 0000000000000001 fffffe00f8047c80 fffffc00080991d0
>> [ 0.291062] 7c60: 0000000024000000 0000000000000001 0000000024000000 fffffc0008ef0b28
>> [ 0.293178] 7c80: fffffe00f8047d00 fffffc0008d361cc fffffe0078416018 00e8000000000707
>> [ 0.295296] 7ca0: fffffc0008ff6410 fffffc0008ef7000 0000000000000000 fffffc0008ff6410
>> [ 0.297408] 7cc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.299523] 7ce0: 0000000000000000 00e8000000000f05 fffffc0008098dd0 0000000023ffffff
>> [ 0.301636] 7d00: fffffe00f8047d10 fffffc0008d35020 fffffe00f8047d40 fffffc0008d88284
>> [ 0.303748] 7d20: fffffe0078416018 fffffc0008ff6000 fffffc0008c87348 fffffc0008d8821c
>> [ 0.305863] 7d40: fffffe00f8047d90 fffffc0008083594 fffffc0008d88154 fffffe00f8044000
>> [ 0.307987] 7d60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.310099] 7d80: 0000000000000000 0000000004000000 fffffe00f8047e00 fffffc0008d30d28
>> [ 0.312217] 7da0: fffffc0008e622d8 fffffc0008e622e0 0000000000000040 0000000000000000
>> [ 0.314333] 7dc0: fffffe00f8047e00 fffffc0008d30d18 fffffc0008e62220 fffffc0008e622e0
>> [ 0.316445] 7de0: 0000000000000040 0000000000000000 0000000000000000 fffffc0008e622e0
>> [ 0.318572] 7e00: fffffe00f8047ea0 fffffc0008956f48 fffffc0008956f30 0000000000000000
>> [ 0.320692] 7e20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.322805] 7e40: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
>> [ 0.324914] 7e60: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
>> [ 0.327027] 7e80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.329139] 7ea0: 0000000000000000 fffffc0008083330 fffffc0008956f30 0000000000000000
>> [ 0.331248] 7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.333361] 7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.335470] 7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.337585] 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.339695] 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.341810] 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.343923] 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.346037] 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.348154] 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
>> [ 0.350272] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 0.352392] Call trace:
>> [ 0.353049] Exception stack(0xfffffe00f8047650 to 0xfffffe00f8047780)
>> [ 0.354792] 7640: fffffe007fd9a018 0000040000000000
>> [ 0.356910] 7660: fffffe00f8047820 fffffc0008487390 fffffe00f80476e0 fffffc0008131290
>> [ 0.359025] 7680: fffffc000901690b fffffc0008f1e000 0000000000000001 fffffe00700aac00
>> [ 0.361140] 76a0: fffffc000901690b fffffc0008f27a28 fffffe00fff3b700 fffffc0008e8b700
>> [ 0.363255] 76c0: fffffe00fff3b700 fffffc0008ef1000 fffffe00f80476e0 00000000000000c0
>> [ 0.365373] 76e0: fffffe00f8047720 fffffc000811a374 fffffc0008138c08 0000b91000006be8
>> [ 0.367483] 7700: 000022c80000f420 0000000000000000 0000b91000006a50 0000000000000080
>> [ 0.369593] 7720: 0000000000000000 fffffc0008299fcc 0000b91000006be8 0000000000000001
>> [ 0.371702] 7740: 0000000000000011 0000000000000000 fffffe007fff2580 0000000000000001
>> [ 0.373817] 7760: 0000000000000000 0000000000000000 0000000000000000 0000000057181979
>> [ 0.375935] [<fffffc0008487390>] __ll_sc_atomic_add+0x20/0x40
>> [ 0.377489] [<fffffc0008138c08>] __lock_acquire+0xe8/0x698
>> [ 0.378960] [<fffffc0008139590>] lock_acquire+0xd8/0x2c0
>> [ 0.380394] [<fffffc000895f2c4>] _raw_spin_lock+0x4c/0x60
>> [ 0.381843] [<fffffc0008299fcc>] get_partial_node.isra.23+0x4c/0x440
>> [ 0.383559] [<fffffc000829aa68>] ___slab_alloc+0x438/0x710
>> [ 0.385031] [<fffffc000829ad94>] __slab_alloc+0x54/0xa0
>> [ 0.386441] [<fffffc000829b13c>] kmem_cache_alloc+0x35c/0x428
>> [ 0.387983] [<fffffc0008266284>] ptlock_alloc+0x2c/0x58
>> [ 0.389394] [<fffffc0008098e24>] pgd_pgtable_alloc+0x54/0xd8
>> [ 0.390912] [<fffffc00080991d0>] __create_pgd_mapping+0x158/0x2a8
>> [ 0.392556] [<fffffc0008d361cc>] create_pgd_mapping+0x30/0x38
>> [ 0.394100] [<fffffc0008d35020>] efi_create_mapping+0xfc/0x110
>> [ 0.395682] [<fffffc0008d88284>] arm_enable_runtime_services+0x130/0x204
>> [ 0.397501] [<fffffc0008083594>] do_one_initcall+0x44/0x138
>> [ 0.399001] [<fffffc0008d30d28>] kernel_init_freeable+0x178/0x2dc
>> [ 0.400646] [<fffffc0008956f48>] kernel_init+0x18/0x110
>> [ 0.402053] [<fffffc0008083330>] ret_from_fork+0x10/0x20
>> [ 0.403488] Code: aa1e03e0 aa0103e8 d503201f f9800111 (885f7d00)
>> [ 0.405145] ---[ end trace f6be31446b0a9526 ]---
>> [ 0.406286] note: swapper/0[1] exited with preempt_count 1
>> [ 0.407687] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>> [ 0.407687]
>> [ 0.410047] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>> [ 0.410047]
>>
>
> This log contains two call traces. The first is a WARNING in
> wq_numa_init(). The second is the unhandled page fault.
>
> Note the warning message (from wq_numa_init()):
>
> workqueue: NUMA node mapping not available for cpu0, disabling NUMA support
>
> Something looks genuinely broken with the cpu <-> numa-node
> associations in the ACPI case -- it even seems to fail when the SRAT
> does exist.
>
> So, perhaps, commit 7ba5f605f3a0 may not have introduced the bug, only
> exposed one in the ACPI code?...

Okay, so let me repeat,

smp_init_cpus() [arch/arm64/kernel/smp.c]
acpi_table_parse_madt() [drivers/acpi/tables.c]
acpi_parse_gic_cpu_interface() [arch/arm64/kernel/smp.c]
acpi_map_gic_cpu_interface() [arch/arm64/kernel/smp.c]
early_map_cpu_to_node() [arch/arm64/mm/numa.c]

We have acpi_map_gic_cpu_interface() being called for each GICC
structure in the MADT (signature "APIC"). This function is supposed to
set up a number of things for the CPU found, including its association
with a NUMA node. This should happen even if we have only one node (no
SRAT), and it should happen for CPU#0 as well.

acpi_map_gic_cpu_interface() uses the global variable "cpu_count" like
this:
(a) on input, it is the number of CPUs found previously, that is, the
logical identifier of the CPU being added presently,
(b) on output, it is bumped by one, if the CPU got added / parsed
correctly,
(c) in-between, we have expressions like:

> if (is_mpidr_duplicate(cpu_count, hwid)) {
> pr_err("duplicate CPU MPIDR 0x%llx in MADT\n", hwid);
> return;
> }

and

> if (cpu_count >= NR_CPUS)
> return;

(note: this implies that NR_CPUS is an exclusive limit)

and -- importantly --

> /* map the logical cpu id to cpu MPIDR */
> cpu_logical_map(cpu_count) = hwid;

and -- even more importantly --

> early_map_cpu_to_node(cpu_count, acpi_numa_get_nid(cpu_count, hwid));

A whole bunch of stuff seems to be wrong with this, when we try to
interpret it for CPU#0. Such as:

(1) the global variable "cpu_count" is initialized to one, not zero.
This dates back to the following commit:

> commit 0f0783365cbb7ec13a8f02198f6e1a146d94a5a9
> Author: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx>
> Date: Wed May 13 14:12:47 2015 +0100
>
> ARM64: kernel: unify ACPI and DT cpus initialization

It means that none of the above checks and assignments will be performed
for CPU#0.

It also means that should we actually find NR_CPUs CPUs, the last one
will be rejected, because at that point, cpu_count will equal NR_CPUs
*on input*.

(2) On arm64, cpu_logical_map() is implemented like this
[arch/arm64/include/asm/smp_plat.h]:

> /*
> * Logical CPU mapping.
> */
> extern u64 __cpu_logical_map[NR_CPUS];
> #define cpu_logical_map(cpu) __cpu_logical_map[cpu]

So this is the declaration. The definition is back in
"arch/arm64/kernel/setup.c":

> u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };

where INVALID_HWID is ULONG_MAX.

This implies that

> /* map the logical cpu id to cpu MPIDR */
> cpu_logical_map(cpu_count) = hwid;

will never store a hwid different from INVALID_HWID to
__cpu_logical_map[0], because "cpu_count" -- the offset into that array,
for the assignment -- is never zero.

(3) early_map_cpu_to_node() will never set cpu_to_node_map[0] to any
NUMA node ID.

(If early_map_cpu_to_node() was called with cpu_count==0 (correctly), it
would call set_cpu_numa_node(), due to the change implemented by
7ba5f605f3a0:

> /*
> * We should set the numa node of cpu0 as soon as possible, because it
> * has already been set up online before. cpu_to_node(0) will soon be
> * called.
> */
> if (!cpu)
> set_cpu_numa_node(cpu, nid);

but I don't know what that would suffice for.)

(4) The acpi_numa_get_nid() function deserves separate treatment:

> int acpi_numa_get_nid(unsigned int cpu, u64 hwid)
> {
> int i;
>
> for (i = 0; i < cpus_in_srat; i++) {
> if (hwid == early_node_cpu_hwid[i].cpu_hwid)
> return early_node_cpu_hwid[i].node_id;
> }
>
> return NUMA_NO_NODE;
> }

So,

(4a) if we have no SRAT (because there's only one NUMA node), then this
function will invariably return NUMA_NO_NODE (value -1), which means
that *even if* early_map_cpu_to_node() was called with cpu_count==0
(which it is not, see (3) above), the assigned NUMA node ID would still
be NUMA_NO_NODE. That's wrong, it should be zero.

(4b) The acpi_numa_get_nid() function completely ignores its first
parameter, called "cpu" (set from "cpu_count" at the call site). This
has been the case since the birth of that function, namely

> commit d8b47fca8c233642d1a20fa4025579ebc8be6f1e
> Author: Hanjun Guo <hanjun.guo@xxxxxxxxxx>
> Date: Tue May 24 15:35:44 2016 -0700
>
> arm64, ACPI, NUMA: NUMA support based on SRAT and SLIT

I guess if that parameter is unnecessary, it should be removed.


I'm sorry but I can't even begin to untangle this mess. Maybe the code I
tried to analyze in this email was never *meant* to associate CPU#0 with
any NUMA node at all (not even node 0); instead, other code -- for
example code removed by 7ba5f605f3a0 -- was meant to perform that
association.

If that's the case, then the code I listed here might even be correct,
for CPUs with logical IDs >= 1. The initialization of "cpu_count" to 1
does suggest that CPU#0 was never meant to be handled by
acpi_map_gic_cpu_interface(). I can't tell.

What I can tell is that 7ba5f605f3a0 breaks the ACPI boot. So
- either (parts of) it should be reverted please,
- or the ACPI boot path should be extended please, so that it handles
CPU#0 as well (associating it with NUMA node #0 if there is no SRAT,
and NUMA node #whatever, if there's an SRAT saying so).

Thanks,
Laszlo