Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory

From: Mike Kravetz
Date: Tue Jul 14 2020 - 19:21:50 EST


On 7/10/20 5:09 AM, Barry Song wrote:
> Online nodes are not necessarily memory containing nodes. Splitting
> huge_cma in online nodes can lead to inconsistent hugetlb_cma size
> with user setting. For example, for one system with 4 numa nodes and
> only one of them has memory, if users set hugetlb_cma to 4GB, it will
> split into four 1GB. So only the node with memory will get 1GB CMA.
> All other three nodes get nothing. That means the whole system gets
> only 1GB CMA while users ask for 4GB.
>
> Thus, it is more sensible to split hugetlb_cma in nodes with memory.
> For the above case, the only node with memory will reserve 4GB cma
> which is same with user setting in bootargs. In order to split cma
> in nodes with memory, hugetlb_cma_reserve() should scan over those
> nodes with N_MEMORY state rather than N_ONLINE state. That means
> the function should be called only after arch code has finished
> setting the N_MEMORY state of nodes.
>
> The problem is always there if N_ONLINE != N_MEMORY. It is a general
> problem to all platforms. But there is some trivial difference among
> different architectures.
> For example, for ARM64, before hugetlb_cma_reserve() is called, all
> nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
> size when some online nodes have no memory. For x86 case, the problem
> is hidden because X86 happens to just set N_ONLINE on the nodes with
> memory when hugetlb_cma_reserve() is called.
>
> Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
> and lets both x86 and ARM64 call the function after N_MEMORY state
> is ready. It also documents the requirement in the definition of
> hugetlb_cma_reserve().
>
> Cc: Roman Gushchin <guro@xxxxxx>
> Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
> Cc: H. Peter Anvin <hpa@xxxxxxxxx>
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Cc: Mike Rapoport <rppt@xxxxxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> Cc: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>
> Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx>

I agree we should only be concerned with N_MEMORY nodes for the CMA
reservations. However, this patch got me thinking:
- Do we really have to initiate the CMA reservations from arch specific code?
- Can we move the call to reserve CMA a little later into hugetlb arch
independent code?

I know the cma_declare_contiguous_nid() routine says it should be called
from arch specific code. However, unless I am missing something that seems
mostly about timing.

What about a change like this on top of this patch?