Re: [LSF/MM/BPF TOPIC] Per-process page size

From: Dev Jain

Date: Wed Feb 18 2026 - 03:42:07 EST

On 17/02/26 8:52 pm, Matthew Wilcox wrote:
> On Tue, Feb 17, 2026 at 08:20:26PM +0530, Dev Jain wrote:
>> 2. Generic Linux MM enlightenment
>> ---------------------------------
>> We enlighten the Linux MM code to always hand out memory in the granularity
> Please don't use the term "enlighten". Tht's used to describe something
> something or other with hypervisors. Come up with a new term or use one
> that already exists.

Sure.

>
>> File memory
>> -----------
>> For a growing list of compliant file systems, large folios can already be
>> stored in the page cache. There is even a mechanism, introduced to support
>> filesystems with block sizes larger than the system page size, to set a
>> hard-minimum size for folios on a per-address-space basis. This mechanism
>> will be reused and extended to service the per-process page size requirements.
>>
>> One key reason that the 64K kernel currently consumes considerably more memory
>> than the 4K kernel is that Linux systems often have lots of small
>> configuration files which each require a page in the page cache. But these
>> small files are (likely) only used by certain processes. So, we prefer to
>> continue to cache those using a 4K page.
>> Therefore, if a process with a larger page size maps a file whose pagecache
>> contains smaller folios, we drop them and re-read the range with a folio
>> order at least that of the process order.
> That's going to be messy. I don't have a good idea for solving this
> problem, but the page cache really isn't set up to change minimum folio
> order while the inode is in use.

Holding mapping->invalidate_lock, bumping mapping->min_folio_order and
dropping-rereading the range suffers from a race - filemap_fault operating
on some other partially populated 64K range will observe in filemap_get_folio
that nothing is in the pagecache. Then, it will read the updated min_order
in __filemap_get_folio, then use filemap_add_folio to add a 64K folio, but since
the 64K range is partially populated, we get stuck in an infinite loop due to -EEXIST.

So I figured that deleting the entire pagecache is simpler. We will also bail
out early in __filemap_add_folio if the folio order asked by the caller to
create is less than mapping_min_folio_order. Eventually the caller is going
to read the correct min order. This algorithm avoids the race above, however...

my assumption here was that we are synchronized on mapping->invalidate_lock.
The kerneldoc above read_cache_folio() and some other comments convinced me
of that, but I just checked with a VM_WARN_ON(!is_rwsem_locked()) in
__filemap_add_folio and this doesn't seem to be the case for all code paths...
If the algorithm sounds reasonable, I wonder what is the correct synchronization
mechanism here.

>
>> - Are there other arches which could benefit from this?
> Some architectures walk the page tables entirely in software, but on the
> other hand, those tend to be, er, "legacy" architectures these days and
> it's doubtful that anybody would invest in adding support.
>
> Sounds like a good question for Arnd ;-)
>
>> - What level of compatibility we can achieve - is it even possible to
>> contain userspace within the emulated ABI?
>> - Rough edges of compatibility layer - pfnmaps, ksm, procfs, etc. For
>> example, what happens when a 64K process opens a procfs file of
>> a 4K process?
>> - native pgtable implementation - perhaps inspiration can be taken
>> from other arches with an involved pgtable logic (ppc, s390)?
> I question who decides what page size a particular process will use.
> The programmer? The sysadmin? It seems too disruptive for the kernel
> to monitor and decide for the app what page size it will use.

It's the sysadmin. The latter method you mention is similar to the problem
of the kernel choosing the correct mTHP order, which we don't have an
elegant idea for solving yet.