Re: [PATCH v3 00/22] Add support for shared PTEs across processes

From: Kalesh Singh

Date: Fri Feb 27 2026 - 01:35:52 EST


On Thu, Feb 26, 2026 at 1:22 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
>
> On Wed, Feb 25, 2026 at 03:06:10PM -0800, Kalesh Singh wrote:
> > On Tue, Feb 24, 2026 at 1:40 AM David Hildenbrand (Arm)
> > <david@xxxxxxxxxx> wrote:
> > >
> > > > I believe that managing a pseudo-filesystem (msharefs) and mapping via
> > > > ioctl during process creation could introduce overhead that impacts
> > > > app startup latency. Ideally, child apps shouldn't be aware of this
> > > > sharing or need to manage the pseudo-filesystem on their end.
> > > All process must be aware of these special semantics.
> > >
> > > I'd assume that fork() would simply replicate mshare region into the
> > > fork'ed child process. So from that point of view, it's "transparent" as
> > > in "no special mshare() handling required after fork".
> >
> > Hi David,
> >
> > That's agood point. If fork() simply replicates the mshare region, it
> > does achieve transparency in terms of setup.
> >
> > I am still concerned about transparency in terms of observability.
> > Applications and sometimes inspect their own mappings (from
> > /proc/self/maps) to locate specific code or data regions for various
> > anti-tamper and obfuscation techniques. [2] If those mappings suddenly
> > point to an msharefs pseudo-file instead of the expected shared
> > library backing, it may break user-space assumptions and cause
> > compatibility issues.
>
> I'm not worried about transparency because this is not supposed to be
> transparent. This is not supposed to be used by most core system software.
> This is supposed to help replace hugetlb page table sharing.
>

Hi Pedro,

Thanks for the detailed breakdown.

Firstly let me state that my goal definitely isn't to derail or block
the current mshare efforts. I'm mostly just trying to gather feedback
on what a "transparent", approach might actually look like.

> Transparent page table sharing has other constraints. I like the idea, in
> theory, but there are a number of constraints that make the idea unfeasible
> for now. There are a couple of problems we need to solve first:
>
> 1) Every spot where we modify PTEs needs to be assessed and use different
> helpers (that can un-cow page tables). Every pte_offset_map_lock() can now
> feasibly fail for OOM reasons (and that also needs to be assessed).
>

What if we strictly limit the scope to just read-only mappings being
shared? Would un-COWing still be necessary?

> 2) Various bits of PTE modification/unmapping now needs special care wrt TLB
> invalidation. The kernel needs to be aware of how the page tables are shared.
> I don't think the current rmap data structures are well suited to this kind
> of stuff (perhaps with Lorenzo's WIP anon rmap rework we'll get something
> better). Basically every spot that goes "modify PTE, flush TLB for mm" now
> needs to go "modify PTE, for every mm that maps this page table, flush $mm"
> (if you're thinking that COW will save us, it technically won't, or shouldn't,
> because of stuff like try_to_unmap_one() that is used in reclaim).

I think this bit might need to be architecture dependent. With shared
TLB partitioning on certain hardware, this becomes much less of an
issue. We could potentially gate this behind something like
CONFIG_ARCH_HAVE_SHARED_TLB_SUPPORT (or a similarly fitting name) so
only architectures that can handle the invalidation efficiently opt
in.

>
> 3) Reclaim loses even more information as now N processes share the same A
> bits. I don't know what effects this can cause. It would require
> experimentation. Perhaps something like "if page table is shared, value
> pte_young more". I don't know if this can work as a bandaid, but it's not
> ideal.

I agree this will require some experimentation. Intuitively, I like to
think these shared pages might naturally stay "hotter" since multiple
processes are accessing them concurrently, but we will definitely need
to experiment with the reclaim logic to see hwo ti does in practice.

>
> 4) It's not known whether page table COW fork() is a real win in most cases,
> or all cases. Would want measurement.

Our preliminary data on Android shows this can save ~200MB or more on
mobile devices right after boot. On memory-constrained client devices,
that is a significant win.

>
> 5) It becomes even harder to estimate RSS and PSS for each process.

For PSS (PAGE_SIZE / mapcount), I can see that a single mapcount from
all the processes mapping the page through the shared page table would
skew the result. Though, I find PSS not perfect already; I think
processes can artificially lower their PSS by mapping the same file
multiple times.

For RSS, I'm not sure I see the blockers to aggregating across the
private and shared mm_structs?

>
> For these reasons (and more, certainly), I don't think working mshare() into
> a transparent, all-great thing that fits the zygote model can work. It has been
> discussed at length how to pull off certain hard bits like TLB invalidation and
> locking for mshare, and with mshare we have the advantage of not needing to
> support every feature ever (tailoring it more to the big database users of
> hugetlb). And we'll still need to adapt certain bits of arch code just to get
> it to work efficiently.
>
> This said, if you want to discuss pulling this off, I'm all ears and it could
> be perhaps a fun discussion (too late for LSF, I guess), but I don't think
> it's workeable into the current mshare efforts. And, believe me, I would love
> a unified feature here :)

I saw Anthony proposed an mshare topic for LSF/MM; I hope to be there
as well, it would be great to chat about this in person.

Thanks,
Kalesh

>
> --
> Pedro