Re: [PATCH v2 1/2] mm/page_alloc: use ac->high_zoneidx for classzone_idx

From: Joonsoo Kim
Date: Thu Mar 19 2020 - 04:58:13 EST


2020ë 3ì 19ì (ë) ìì 6:29, David Rientjes <rientjes@xxxxxxxxxx>ëì ìì:
>
> On Wed, 18 Mar 2020, js1304@xxxxxxxxx wrote:
>
> > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> >
> > Currently, we use the zone index of preferred_zone which represents
> > the best matching zone for allocation, as classzone_idx. It has a problem
> > on NUMA system with ZONE_MOVABLE.
> >
>
> Hi Joonsoo,

Hello, David.

> More specifically, it has a problem on NUMA systems when the lowmem
> reserve protection exists for some zones on a node that do not exist on
> other nodes, right?

Right.

> In other words, to make sure I understand correctly, if your node 1 had a
> ZONE_MOVABLE than this would not have happened. If that's true, it might
> be helpful to call out that ZONE_MOVABLE itself is not necessarily a
> problem, but a system where one node has ZONE_NORMAL and ZONE_MOVABLE and
> another only has ZONE_NORMAL is the problem.

Okay. I will try to re-write the commit message as you suggested.

> > In NUMA system, it can be possible that each node has different populated
> > zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> > node 1 could have only NORMAL zone. In this setup, allocation request
> > initiated on node 0 and the one on node 1 would have different
> > classzone_idx, 3 and 2, respectively, since their preferred_zones are
> > different. If they are handled by only their own node, there is no problem.
>
> I'd say "If the allocation is local" rather than "If they are handled by
> only their own node".

I will replace it with yours. Thanks for correcting.

> > However, if they are somtimes handled by the remote node due to memory
> > shortage, the problem would happen.
> >
> > In the following setup, allocation initiated on node 1 will have some
> > precedence than allocation initiated on node 0 when former allocation is
> > processed on node 0 due to not enough memory on node 1. They will have
> > different lowmem reserve due to their different classzone_idx thus
> > an watermark bars are also different.
> >
> > root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
> > Node 0, zone DMA
> > per-node stats
> > ...
> > pages free 3965
> > min 5
> > low 8
> > high 11
> > spanned 4095
> > present 3998
> > managed 3977
> > protection: (0, 2961, 4928, 5440)
> > ...
> > Node 0, zone DMA32
> > pages free 757955
> > min 1129
> > low 1887
> > high 2645
> > spanned 1044480
> > present 782303
> > managed 758116
> > protection: (0, 0, 1967, 2479)
> > ...
> > Node 0, zone Normal
> > pages free 459806
> > min 750
> > low 1253
> > high 1756
> > spanned 524288
> > present 524288
> > managed 503620
> > protection: (0, 0, 0, 4096)
> > ...
> > Node 0, zone Movable
> > pages free 130759
> > min 195
> > low 326
> > high 457
> > spanned 1966079
> > present 131072
> > managed 131072
> > protection: (0, 0, 0, 0)
> > ...
> > Node 1, zone DMA
> > pages free 0
> > min 0
> > low 0
> > high 0
> > spanned 0
> > present 0
> > managed 0
> > protection: (0, 0, 1006, 1006)
> > Node 1, zone DMA32
> > pages free 0
> > min 0
> > low 0
> > high 0
> > spanned 0
> > present 0
> > managed 0
> > protection: (0, 0, 1006, 1006)
> > Node 1, zone Normal
> > per-node stats
> > ...
> > pages free 233277
> > min 383
> > low 640
> > high 897
> > spanned 262144
> > present 262144
> > managed 257744
> > protection: (0, 0, 0, 0)
> > ...
> > Node 1, zone Movable
> > pages free 0
> > min 0
> > low 0
> > high 0
> > spanned 262144
> > present 0
> > managed 0
> > protection: (0, 0, 0, 0)
> >
> > min watermark for NORMAL zone on node 0
> > allocation initiated on node 0: 750 + 4096 = 4846
> > allocation initiated on node 1: 750 + 0 = 750
> >
> > This watermark difference could cause too many numa_miss allocation
> > in some situation and then performance could be downgraded.
> >
> > Recently, there was a regression report about this problem on CMA patches
> > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> > that problem is disappeared with this fix that uses high_zoneidx
> > for classzone_idx.
> >
> > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> >
> > Using high_zoneidx for classzone_idx is more consistent way than previous
> > approach because system's memory layout doesn't affect anything to it.
> > With this patch, both classzone_idx on above example will be 3 so will
> > have the same min watermark.
> >
> > allocation initiated on node 0: 750 + 4096 = 4846
> > allocation initiated on node 1: 750 + 4096 = 4846
> >
>
> Alternatively, I assume that this could also be fixed by changing the
> value of the lowmem protection on the node without managed pages in the
> upper zone to be the max protection from the lowest zones? In your
> example, node 1 ZONE_NORMAL would then be (0, 0, 0, 4096).

No, if lowmem_reserve of node 0 ZONE_NORMAL is (0, 0, 4096, 4096),
min watermark of the allocation initiated on node 1 is 750 +
4096(classzone_idx 2)
when allocation is tried on node 0 ZONE_NORMAL and issue would be gone.
So, I think that it cannot be fixed by your alternative.

> > One could wonder if there is a side effect that allocation initiated on
> > node 1 will use higher bar when allocation is handled on node 1 since
> > classzone_idx could be higher than before. It will not happen because
> > the zone without managed page doesn't contributes lowmem_reserve at all.
> >
> > Reported-by: Ye Xiaolong <xiaolong.ye@xxxxxxxxx>
> > Tested-by: Ye Xiaolong <xiaolong.ye@xxxxxxxxx>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
>
> Curious: is this only an issue when vm.numa_zonelist_order is set to Node?

Do you mean "/proc/sys/vm/numa_zonelist_order"? It looks like it's gone now.

Thanks.