Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

From: David Hildenbrand
Date: Wed Jan 31 2024 - 05:43:24 EST

Next message: Christoph Hellwig: "Re: [PATCH 19/19] writeback: simplify writeback iteration"
Previous message: tiozhang: "[PATCH] hrtimer: add cmdline parameter retry_threshold to config retry times in interrupt handler routine"
In reply to: Michal Hocko: "Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 31.01.24 03:20, Yin Fengwei wrote:

On 1/29/24 22:32, David Hildenbrand wrote:

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?

Good point, we do have CONFIG_INIT_ON_FREE_DEFAULT_ON. I don't remember if init_on_free or init_on_alloc was used in production systems. In tlb_batch_pages_flush(), there is a cond_resched() to limit the number of entries we process.

So if that is actually problematic, we'd run into a soft-lockup and need another cond_resched() [I have some faint recollection that people are working on removing cond_resched() completely].

One could do some counting in free_pages_and_swap_cache() (where we iterate all entries already) and insert cond_resched+release_pages() for every (e.g., 512) pages.

--
Cheers,

David / dhildenb

Next message: Christoph Hellwig: "Re: [PATCH 19/19] writeback: simplify writeback iteration"
Previous message: tiozhang: "[PATCH] hrtimer: add cmdline parameter retry_threshold to config retry times in interrupt handler routine"
In reply to: Michal Hocko: "Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]