Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86

From: Kalesh Singh

Date: Fri Feb 20 2026 - 14:33:42 EST


On Fri, Feb 20, 2026 at 8:30 AM David Hildenbrand (Arm)
<david@xxxxxxxxxx> wrote:
>
> On 2/20/26 13:07, Kiryl Shutsemau wrote:
> > On Fri, Feb 20, 2026 at 11:24:37AM +0100, David Hildenbrand (Arm) wrote:
> >>>
> >>> Just to clarify, do you want it to be enforced on userspace ABI.
> >>> Like, all mappings are 64k aligned?
> >>
> >> Right, see the proposal from Dev on the list.
> >>
> >> From user-space POV, the pagesize would be 64K for these emulated processes.
> >> That is, VMAs must be suitable aligned etc.
> >
> > Well, it will drastically limit the adoption. We have too much legacy
> > stuff on x86.
>
> I'd assume that many applications nowadays can deal with differing page
> sizes (thanks to some other architectures paving the way).
>
> But yes, some real legacy stuff, or stuff that ever only cared about
> intel still hardcodes pagesize=4k.

I think most issues will stem from linkers setting the default ELF
segment alignment (max-page-size) for x86 to 4096. So those ELFs will
not load correctly or at all on the larger emulated granularity.

-- Kalesh

>
> In Meta's fleet, I'd be quite interesting how much conversion there
> would have to be done.
>
> For legacy apps, you could still run them as 4k pagesize on the same
> system, of course.
>
> >
> >>>
> >>> Waste of memory for page table is solvable and pretty straight forward.
> >>> Most of such cases can be solve mechanically by switching to slab.
> >>
> >> Well, yes, like Willy says, there are already similar custom solutions for
> >> s390x and ppc.
> >>
> >> Pasha talked recently about the memory waste of 16k kernel stacks and how we
> >> would want to reduce that to 4k. In your proposal, it would be 64k, unless
> >> you somehow manage to allocate multiple kernel stacks from the same 64k
> >> page. My head hurts thinking about whether that could work, maybe it could
> >> (no idea about guard pages in there, though).
> >
> > Kernel stack is allocated from vmalloc. I think mapping them with
> > sub-page granularity should be doable.
>
> I still have to wrap my head around the sub-page mapping here as well.
> It's scary.
>
> Re mapcount: I think if any part of the page is mapped, it would be
> considered mapped -> mapcount += 1.
>
> >
> > BTW, do you see any reason why slab-allocated stack wouldn't work for
> > large base page sizes? There's no requirement for it be aligned to page
> > or PTE, right?
>
> I'd assume that would work. Devil is in the detail with these things
> before we have memdescs.
>
> E.g., page table have a dedicated type (PGTY_table) and store separate
> metadata in the ptdesc. For kernel stack there was once a proposal to
> have a type but it is not upstream.
>
> >
> >> Let's take a look at the history of page size usage on Arm (people can feel
> >> free to correct me):
> >>
> >> (1) Most distros were using 64k on Arm.
> >>
> >> (2) People realized that 64k was suboptimal many use cases (memory
> >> waste for stacks, pagecache, etc) and started to switch to 4k. I
> >> remember that mostly HPC-centric users sticked to 64k, but there was
> >> also demand from others to be able to stay on 64k.
> >>
> >> (3) Arm improved performance on a 4k kernel by adding cont-pte support,
> >> trying to get closer to 64k native performance.
> >>
> >> (4) Achieving 64k native performance is hard, which is why per-process
> >> page sizes are being explored to get the best out of both worlds
> >> (use 64k page size only where it really matters for performance).
> >>
> >> Arm clearly has the added benefit of actually benefiting from hardware
> >> support for 64k.
> >>
> >> IIUC, what you are proposing feels a bit like traveling back in time when it
> >> comes to the memory waste problem that Arm users encountered.
> >>
> >> Where do you see the big difference to 64k on Arm in your proposal? Would
> >> you currently also be running 64k Arm in production and the memory waste etc
> >> is acceptable?
> >
> > That's the point. I don't see a big difference to 64k Arm. I want to
> > bring this option to x86: at some machine size it makes sense trade
> > memory consumption for scalability. I am targeting it to machines with
> > over 2TiB of RAM.
> >
> > BTW, we do run 64k Arm in our fleet. There's some growing pains, but it
> > looks good in general We have no plans to switch to 4k (or 16k) at the
> > moment. 512M THPs also look good on some workloads.
>
> Okay, that's valuable information, thanks!
>
> Being able to remove the sub-page mapping part (or being able to just
> hide it somewhere deep down in arch code) would make this a lot easier
> to digest.
>
> --
> Cheers,
>
> David
>