Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.

From: David Hildenbrand
Date: Mon Aug 09 2021 - 03:41:52 EST

On 05.08.21 21:02, Zi Yan wrote:
From: Zi Yan <ziy@xxxxxxxxxx>

Hi all,

This patchset add support for kernel boot time adjustable MAX_ORDER, so that
user can change the largest size of pages obtained from buddy allocator. It also
removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.


This enables kernel to allocate 1GB pages and is necessary for my ongoing work
on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
after some discussion with David Hildenbrand on what methods should be used for
allocating gigantic pages[2], since other approaches like using CMA allocator or
alloc_contig_pages() are regarded as suboptimal.

This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since
increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk
size will be increased as well, causing memory management difficulty for VMs.

In addition, make MAX_ORDER a kernel boot time parameter can enable user to
adjust buddy allocator without recompiling the kernel for their own needs, so
that one can still have a small MAX_ORDER if he/she does not need to allocate
gigantic pages like 1GB PUD THPs.


At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS
restriction. This prevents buddy allocator merging pages across memory sections,
as PFNs might not be contiguous and code like page++ would fail. But this would
not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are
virtually contiguous. In addition, as long as buddy allocator checks the PFN
validity during buddy page merging (done in Patch 3), pages allocated from
buddy allocator can be manipulated by code like page++.


I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base
pages. The systems boot and ltp mm test suite finished without issue. Also
memory hotplug worked on x86_64 when I tested. It definitely needs more tests
and reviews for other architectures.

In terms of the concerns on performance degradation if MAX_ORDER is increased,
I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on
x86_64 machines and saw no performance difference[3].

Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses
irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The
exclusion does not break the check in 32bit kernel, since the check will still
be performed during other kernel component compilation.

Patch 2 gives FORCE_MAX_ZONEORDER a better name.

Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
pages across memory sections. The check was removed when ARM64 gets rid of holes
in zones, but holes can appear in zones again after this patchset.

Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative
constants, since these code places use MAX_ORDER as boundary check for
physically contiguous pages, where SECTION_SIZE_BITS should be used. After this
patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break.
I separate changes to different patches for easy review and can merge them into
a single one if that works better.

Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER
when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64.

Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic
ones if possible and prepares for making MAX_ORDER a boot time parameter.

Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic
MAX_ORDER for places where converting static array to dynamic one is causing
hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB.

Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter.

Any suggestion and/or comment is welcome. Thanks.


1. Redo the performance comparison tests using this patchset to understand the
performance implication of changing MAX_ORDER.

2. Make alloc_contig_range() cleanly deal with pageblock_order instead of MAX_ORDER - 1 to not force the minimal CMA area size/alignment to be e.g., 1 GiB instead of 4 MiB and to keep virtio-mem working as expected.

virtio-mem short term would mean disallowing initialization when an incompatible setup (MAX_ORDER_NR_PAGE > SECTION_NR_PAGES) is detected and bailing out loud that the admin has to fix that on the command line. I have optimizing alloc_contig_range() on my todo list, to get rid of the MAX_ORDER -1 dependency in virtio-mem; but I have no idea when I'll have time to work on that.


David / dhildenb