Re: [Patch V2 2/2] x86/mm/numa: remove the numa_nodemask_from_meminfo()

From: Wei Yang
Date: Mon Apr 10 2017 - 12:42:51 EST


On Mon, Apr 10, 2017 at 02:43:20PM +0200, Borislav Petkov wrote:
>On Sun, Apr 09, 2017 at 11:12:14AM +0800, Wei Yang wrote:
>> Oops, sorry to bring in the regression with my cleanup.
>> I haven't noticed there is a kernel command line "numa=fake", which
>> is the cause of the crash I think.
>
>Of course it is, didn't you see my debugging upthread?
>
>> So from my understanding, I am goting to do these tests:
>>
>> 1. all fake numa scenarios with Kirill's qemu command line
>
>It is enough if you boot the kernel with "numa=fake..."
>
>> 2. Real numa scenarios with following qemu command option
>
>Not qemu command option but a kernel cmdline option.
>
>> 3. Baremetal
>>
>> One more question, on the baremetal mathine, I can't change the
>> numa configuration, so there would be only one case. Do you have
>> some specific requirement?
>
>numa=fake on baremetal too.
>
>> Well, if I missed something, just let me know :-)
>>
>> > Qemu can emulate real numa too, for example you can boot with:
>> >
>> > -smp 64 \
>> > -numa node,nodeid=0,cpus=1-8 \
>> > -numa node,nodeid=1,cpus=9-16 \
>> > -numa node,nodeid=2,cpus=17-24 \
>> > -numa node,nodeid=3,cpus=25-32 \
>> > -numa node,nodeid=4,cpus=0 \
>> > -numa node,nodeid=4,cpus=33-39 \
>> > -numa node,nodeid=5,cpus=40-47 \
>> > -numa node,nodeid=6,cpus=48-55 \
>> > -numa node,nodeid=7,cpus=56-63
>
>Also, do this in kvm. kvm can emulate a lot of numa configurations, do
>experiment with those too.
>
>Basically, try to break your "cleanup". Stuff one should do for every
>patch one sends anyway.

Hi, Borislav

I have tried several test combinations of the fake numa. The result shows good.

The test result marked as P (Passed), means the system boots up and simple
kernel build test succeed.

# test matrix and result

## Qemu

With qemu, I have tried [phys_node, emu_node] = [(1, 4), (0, 2, 4, 8)]

+----------------+--------+--------+
| phys_node | 1 | 4 |
|emu_node | | |
+----------------+--------+--------+
| 0 | P | P |
+----------------+--------+--------+
| 2 | P | P |
+----------------+--------+--------+
| 4 | P | P |
+----------------+--------+--------+
| 8 | P | P |
+----------------+--------+--------+

phys_node is emulated with qemu command line:

"-numa node,nodeid=0,cpus=1-2 -numa node,nodeid=1,cpus=3-4 -numa
node,nodeid=2,cpus=0 -numa node,nodeid=2,cpus=5 -numa
node,nodeid=3,cpus=6-7"

emu_node is emulated with kernel command line:

"numa=fake=N"

## Baremetal

On my machine, it only has one numa node, so I could just verify phys_node
with 1.

+----------------+--------+
| phys_node | 1 |
|emu_node | |
+----------------+--------+
| 0 | P |
+----------------+--------+
| 2 | P |
+----------------+--------+
| 4 | P |
+----------------+--------+
| 8 | P |
+----------------+--------+


emu_node is emulated with kernel command line:

"numa=fake=N"

# Other things I observed

Generally, in qemu guest, every thing looks good, while there are two things I
saw in baremetal machine.

At first I want to emphasize, I saw the same behavior with/without my
"cleanup".

## only 3 node when fake=4

[ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000022f5fffff]
[ 0.000000] Faking node 0 at [mem 0x0000000000000000-0x000000007fffffff]
(2048MB)
[ 0.000000] Faking node 1 at [mem 0x0000000080000000-0x0000000133ffffff]
(2880MB)
[ 0.000000] Faking node 2 at [mem 0x0000000134000000-0x000000022f5fffff]
(4022MB)
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009cfff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x000000007fffffff]
[ 0.000000] node 1: [mem 0x0000000080000000-0x00000000ba5b1fff]
[ 0.000000] node 1: [mem 0x00000000ba5b9000-0x00000000bad8dfff]
[ 0.000000] node 1: [mem 0x00000000bafb6000-0x00000000ca8a1fff]
[ 0.000000] node 1: [mem 0x00000000ca93a000-0x00000000ca977fff]
[ 0.000000] node 1: [mem 0x00000000cafff000-0x00000000caffffff]
[ 0.000000] node 1: [mem 0x0000000100000000-0x0000000133ffffff]
[ 0.000000] node 2: [mem 0x0000000134000000-0x000000022f5fffff]

## some warning

I don't see these two warnings without "numa=fake=N".

[ 0.004000] sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.004000] ------------[ cut here ]------------
[ 0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:424 topology_sane.isra.5+0x6c/0x70

[ 8.594469] sysfs: cannot create duplicate filename '/devices/platform/coretemp.0/hwmon/hwmon2/temp2_label'
[ 8.594478] ------------[ cut here ]------------
[ 8.594482] WARNING: CPU: 4 PID: 34 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x56/0x70

# Some thoughts on the code

After went throught the numa_emulation(), I suggest to restructure the
numa_nodes_parsed based on the emulated nodes, instead of set
numa_nodes_parsed directly in emu_setup_memblk().

Two cases in my mind, which are not friendly:
1. split_nodes_size_interleave/split_nodes_interleave() may fail or the
following procedure may fail.
2. fake node may be less than physcial nodes

Both of them may leads to a inaccurate numa_nodes_parsed. So I have a patch to
restructure it from emulated node info.

Will send it soon.

>
>--
>Regards/Gruss,
> Boris.
>
>Good mailing practices for 400: avoid top-posting and trim the reply.

--
Wei Yang
Help you, Help me

Attachment: signature.asc
Description: PGP signature