Re: [PATCH v3 00/22] Add support for shared PTEs across processes

From: Kalesh Singh

Date: Mon Feb 23 2026 - 12:43:25 EST

On Sat, Feb 21, 2026 at 4:40 AM Pedro Falcato <pfalcato@xxxxxxx> wrote:
>
> On Fri, Feb 20, 2026 at 01:35:58PM -0800, Kalesh Singh wrote:
> > On Tue, Aug 19, 2025 at 6:57 PM Anthony Yznaga
> > <anthony.yznaga@xxxxxxxxxx> wrote:
> > >
> > > Memory pages shared between processes require page table entries
> > > (PTEs) for each process. Each of these PTEs consume some of
> > > the memory and as long as the number of mappings being maintained
> > > is small enough, this space consumed by page tables is not
> > > objectionable. When very few memory pages are shared between
> > > processes, the number of PTEs to maintain is mostly constrained by
> > > the number of pages of memory on the system. As the number of shared
> > > pages and the number of times pages are shared goes up, amount of
> > > memory consumed by page tables starts to become significant. This
> > > issue does not apply to threads. Any number of threads can share the
> > > same pages inside a process while sharing the same PTEs. Extending
> > > this same model to sharing pages across processes can eliminate this
> > > issue for sharing across processes as well.
> > >
> > > <snip>
> > Hi Anthony,
> >
> > Thanks for continuing to push this forward, and apologies for joining
> > this discussion late. I am likely missing some context from the
> > various previous iterations of this feature, but I'd like to throw
> > another use case into the mix to be considered around the design of
> > the sharing API.
> >
> > We are exploring a similar optimization for Android to reduce page
> > table overhead. In Android, we preload many ELF mappings in the Zygote
> > process to help application launch times. Since the Zygote model is
> > fork-but-no-exec, all applications inherit these mappings, which can
> > result in upwards of 200 MB of redundant page table overhead per
> > device.
>
> This can be solved by simply not using the Zygote model :p Or perhaps
> MADV_DONTNEED/straight up unmapping libraries you don't need in the child's
> side.

I think that's a separate topic, but that model is used on billions of
client devices :) The common runtime for apps and other core system
code is preloaded to significantly reduce app startup latencies.

>
> >
> > I believe that managing a pseudo-filesystem (msharefs) and mapping via
> > ioctl during process creation could introduce overhead that impacts
> > app startup latency. Ideally, child apps shouldn't be aware of this
> > sharing or need to manage the pseudo-filesystem on their end. To
> > achieve this "transparent" sharing, I would prefer Khalid's previous
> > API from his 2022 RFC [1]. By attaching the shared mm directly to the
> > file's address_space and exposing a MAP_SHARED_PT flag, child apps
> > could transparently inherit the shared page tables during fork().
>
> So, we've discussed this before. I initially liked this idea a lot more.
> However, there are a couple of problems here:
>
> 1) mshare (as in the mshare feature) isn't really aiming for transparent here.
> There is e.g a specific need to setup an mshare region, with a few files/anon
> there, and then later mprotect/munmap parts of the region - and have it apply
> on every process that has it mapped. This is why we're aiming for different
> system calls (not ioctls anymore), doing munmap(mshare_reg, 4096) is ambiguous
> as to whether you want to unmap the mshare VMA, or a VMA inside the mshare mm.

Since we are interested in sharing text here, how does this play with
stuff like symbolization for call stacks? I believe this is another
reason where we might want to avoid mapping the pseudo mshare file
wrapper?

>
> 2) Sharing the page table at all (even worse so, Transparently(tm)) is a huge
> pain. TLB shootdown becomes much harder, and rmap as-is isn't suited to deal
> with this case. The way things are going with mshare, the container mm will
> have one single entry in rmap, and then actually doing the shootdown is a
> huuuuge pain (which, fwiw, will probably need a per-mshare TLB workaround),
> because you need to find out and shoot down _every_ mm that has these tables

I agree the TLB shootdowns would be a pain. Perhaps, if there was a
concept of a shared ASID/PCID in the hardware, that would make things
less so ...

> mapped. And then, naturally, since you're sharing page tables, doing A/D bit
> collection on these becomes extremely useless - and that will naturally pose
> problems to the reclaim process if you abuse it.

I think in the use case I described, it would mostly be sharing
MAP_PRIVATE stuff, and the access bit should still apply for global
reclaim. However, I agree it becomes difficult to reason especially if
you throw memcgs into the mix.

Thanks,
Kalesh

>
> 3) other misc problems that make it hard to work transparently (VMA alignment,
> levels which you may or may not want to share, you need to revisit most page
> table walkers in the kernel to get a completely transparent feature, etc)
>
> >
> > Regarding David's and Matthew's discussion on VMA-modifying functions,
> > I would lean towards the standard VMA manipulating APIs should be
> > preferred over custom ioctls to preserve transparency for user-space.
> > Perhaps whether or not these modifications persist across all sharing
> > processes needs to be configurable? It seems that for database
> > workloads, having the updates reflected everywhere would be the
> > desired behavior. In the use case described for Android, we don't want
> > apps to be able to modify these shared ELF mappings. To handle this,
> > it's likely we would do something like mseal() the VMAs in the dynamic
> > loader before forking.
>
> mshare_mseal!
>
> >
> > Perhaps we could decouple the core sharing logic from the sharing API
> > itself? Since the sharing interface seems one of the main areas where
> > we don't have a good consensus yet, perhaps we could land the core
> > sharing logic first. Keeping the core infrastructure generic would
>
> I think the core infrastructure is relatively generic (at least the
> small core mm modifications to get this to even work) already, but
> perhaps Anthony can comment on that.
>
> --
> Pedro