Re: [PATCH v3 00/22] Add support for shared PTEs across processes

From: Pedro Falcato

Date: Thu Feb 26 2026 - 16:22:49 EST


On Wed, Feb 25, 2026 at 03:06:10PM -0800, Kalesh Singh wrote:
> On Tue, Feb 24, 2026 at 1:40 AM David Hildenbrand (Arm)
> <david@xxxxxxxxxx> wrote:
> >
> > > I believe that managing a pseudo-filesystem (msharefs) and mapping via
> > > ioctl during process creation could introduce overhead that impacts
> > > app startup latency. Ideally, child apps shouldn't be aware of this
> > > sharing or need to manage the pseudo-filesystem on their end.
> > All process must be aware of these special semantics.
> >
> > I'd assume that fork() would simply replicate mshare region into the
> > fork'ed child process. So from that point of view, it's "transparent" as
> > in "no special mshare() handling required after fork".
>
> Hi David,
>
> That's agood point. If fork() simply replicates the mshare region, it
> does achieve transparency in terms of setup.
>
> I am still concerned about transparency in terms of observability.
> Applications and sometimes inspect their own mappings (from
> /proc/self/maps) to locate specific code or data regions for various
> anti-tamper and obfuscation techniques. [2] If those mappings suddenly
> point to an msharefs pseudo-file instead of the expected shared
> library backing, it may break user-space assumptions and cause
> compatibility issues.

I'm not worried about transparency because this is not supposed to be
transparent. This is not supposed to be used by most core system software.
This is supposed to help replace hugetlb page table sharing.

Transparent page table sharing has other constraints. I like the idea, in
theory, but there are a number of constraints that make the idea unfeasible
for now. There are a couple of problems we need to solve first:

1) Every spot where we modify PTEs needs to be assessed and use different
helpers (that can un-cow page tables). Every pte_offset_map_lock() can now
feasibly fail for OOM reasons (and that also needs to be assessed).

2) Various bits of PTE modification/unmapping now needs special care wrt TLB
invalidation. The kernel needs to be aware of how the page tables are shared.
I don't think the current rmap data structures are well suited to this kind
of stuff (perhaps with Lorenzo's WIP anon rmap rework we'll get something
better). Basically every spot that goes "modify PTE, flush TLB for mm" now
needs to go "modify PTE, for every mm that maps this page table, flush $mm"
(if you're thinking that COW will save us, it technically won't, or shouldn't,
because of stuff like try_to_unmap_one() that is used in reclaim).

3) Reclaim loses even more information as now N processes share the same A
bits. I don't know what effects this can cause. It would require
experimentation. Perhaps something like "if page table is shared, value
pte_young more". I don't know if this can work as a bandaid, but it's not
ideal.

4) It's not known whether page table COW fork() is a real win in most cases,
or all cases. Would want measurement.

5) It becomes even harder to estimate RSS and PSS for each process.

For these reasons (and more, certainly), I don't think working mshare() into
a transparent, all-great thing that fits the zygote model can work. It has been
discussed at length how to pull off certain hard bits like TLB invalidation and
locking for mshare, and with mshare we have the advantage of not needing to
support every feature ever (tailoring it more to the big database users of
hugetlb). And we'll still need to adapt certain bits of arch code just to get
it to work efficiently.

This said, if you want to discuss pulling this off, I'm all ears and it could
be perhaps a fun discussion (too late for LSF, I guess), but I don't think
it's workeable into the current mshare efforts. And, believe me, I would love
a unified feature here :)

--
Pedro