Re: [PATCH v10 00/12] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%

From: David Hildenbrand
Date: Tue May 28 2024 - 04:42:15 EST


Am 10.05.24 um 08:51 schrieb Byungchul Park:
Hi everyone,

While I'm working with a tiered memory system e.g. CXL memory, I have
been facing migration overhead esp. tlb shootdown on promotion or
demotion between different tiers. Yeah.. most tlb shootdowns on
migration through hinting fault can be avoided thanks to Huang Ying's
work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
is inaccessible"). See the following link for more information:

https://lore.kernel.org/lkml/20231115025755.GA29979@xxxxxxxxxxxxxxxxxxx/

However, it's only for migration through hinting fault. I thought it'd
be much better if we have a general mechanism to reduce all the tlb
numbers that we can apply to any unmap code, that we normally believe
tlb flush should be followed.

I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), defers tlb flush
until folios that have been unmapped and freed, eventually get allocated
again. It's safe for folios that had been mapped read-only and were
unmapped, since the contents of the folios don't change while staying in
pcp or buddy so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission. Otherwise, the system will get messed up.

To achieve that:

1. For the folios that map only to non-writable tlb entries, prevent
tlb flush during unmapping but perform it just before the folios
actually become used, out of buddy or pcp.

Trying to understand the impact: Effectively, a CPU could still read data from a page that has already been freed, until that page gets reallocated again.

The important part I can see is

1) PCP/buddy must not change page content (e.g., poison, init_on_free), otherwise an app might read wrong content.

2) If we mess up the flush-before-realloc, an app might observe data written by whoever allocated the page.

3) We must reliably detect+handle any read-only PTEs for which we didn't flush the TLB yet, otherwise an app could see its memory writes getting lost. I recall that at least uffd-wp might defer TLB flushes (see comment in do_wp_page()). Not sure about other pte_wrprotect() callers that flush the TLB after processing multiple page tables, whereby rmap code might succeed in unmapping a page before the TLB flush happened.

Any other possible issues you stumbled over that are worth mentioning?

--
Thanks,

David / dhildenb