Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
From: David Hildenbrand (Arm)
Date: Thu Feb 19 2026 - 10:56:19 EST
On 2/19/26 16:50, Kiryl Shutsemau wrote:
On Thu, Feb 19, 2026 at 03:33:47PM +0000, Pedro Falcato wrote:
On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
No, there's no new hardware (that I know of). I want to explore what page size
means.
The kernel uses the same value - PAGE_SIZE - for two things:
- the order-0 buddy allocation size;
- the granularity of virtual address space mapping;
I think we can benefit from separating these two meanings and allowing
order-0 allocations to be larger than the virtual address space covered by a
PTE entry.
Doesn't this idea make less sense these days, with mTHP? Simply by toggling one
of the entries in /sys/kernel/mm/transparent_hugepage.
mTHP is still best effort. This is way you don't need to care about
fragmentation, you will get your 64k page as long as you have free
memory.
The main motivation is scalability. Managing memory on multi-terabyte
machines in 4k is suboptimal, to say the least.
Potential benefits of the approach (assuming 64k pages):
- The order-0 page size cuts struct page overhead by a factor of 16. From
~1.6% of RAM to ~0.1%;
- TLB wins on machines with TLB coalescing as long as mapping is naturally
aligned;
- Order-5 allocation is 2M, resulting in less pressure on the zone lock;
- 1G pages are within possibility for the buddy allocator - order-14
allocation. It can open the road to 1G THPs.
- As with THP, fewer pages - less pressure on the LRU lock;
We could perhaps add a way to enforce a min_order globally on the page cache,
as a way to address it.
Raising min_order is not free. I puts more pressure on page allocator.
There are some points there which aren't addressed by mTHP work in any way
(1G THPs for one), others which are being addressed separately (memdesc work
trying to cut down on struct page overhead).
(I also don't understand your point about order-5 allocation, AFAIK pcp will
cache up to COSTLY_ORDER (3) and PMD order, but I'm probably not seeing the
full picture)
With higher base page size, page allocator doesn't need to do as much
work to merge/split buddy pages. So serving the same 2M as order-5 is
cheaper than order-9.
I think the idea is that if most of your allocations (anon + pagecache) are 64k instead of 4k, on average, you'll just naturally do less merging splitting.
--
Cheers,
David