Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes

From: Anthony Yznaga
Date: Tue Oct 15 2024 - 21:02:42 EST



On 10/14/24 1:07 PM, Jann Horn wrote:
On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@xxxxxxxxxx> wrote:
One major issue to address for this series to function correctly
is how to ensure proper TLB flushing when a page in a shared
region is unmapped. For example, since the rmaps for pages in a
shared region map back to host vmas which point to a host mm, TLB
flushes won't be directed to the CPUs the sharing processes have
run on. I am by no means an expert in this area. One idea is to
install a mmu_notifier on the host mm that can gather the necessary
data and do flushes similar to the batch flushing.
The mmu_notifier API has two ways you can use it:

First, there is the classic mode, where before you start modifying
PTEs in some range, you remove mirrored PTEs from some other context,
and until you're done with your PTE modification, you don't allow
creation of new mirrored PTEs. This is intended for cases where
individual PTE entries are copied over to some other context (such as
EPT tables for virtualization). When I last looked at that code, it
looked fine, and this is what KVM uses. But it probably doesn't match
your usecase, since you wouldn't want removal of a single page to
cause the entire page table containing it to be temporarily unmapped
from the processes that use it?

No, definitely don't want to do that. :-)



Second, there is a newer mode for IOMMUv2 stuff (using the
mmu_notifier_ops::invalidate_range callback), where the idea is that
you have secondary MMUs that share the normal page tables, and so you
basically send them invalidations at the same time you invalidate the
primary MMU for the process. I think that's the right fit for this
usecase; however, last I looked, this code was extremely broken (see
https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@xxxxxxxxxxxxxx/
for context). Unless that's changed in the meantime, I think someone
would have to fix that code before it can be relied on for new
usecases.

Thank you for this background! Looks like there have since been some changes to the mmu notifiers, and the invalidate_range callback became arch_invalidate_secondary_tlbs. I'm currently looking into using it to flush all TLBs.


Anthony