Hi David,
On 26/02/2024 17:41, Ryan Roberts wrote:
On 22/02/2024 10:20, David Hildenbrand wrote:
On 22.02.24 11:19, David Hildenbrand wrote:
On 25.10.23 16:45, Ryan Roberts wrote:
As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.
The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.
The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.
That is the __try_to_reclaim_swap() -> folio_free_swap() ->
folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
The "difference" is that you will now (1) get another temporary
reference on the folio and (2) (try)lock the folio every time you
discard a single PTE of a (possibly) large THP.
Thinking about it, your change will not only affect THP, but any call to
free_swap_and_cache().
Likely that's not what we want. :/
Is folio_trylock() really that expensive given the code path is already locking
multiple spinlocks, and I don't think we would expect the folio lock to be very
contended?
I guess filemap_get_folio() could be a bit more expensive, but again, is this
really a deal-breaker?
I'm just trying to refamiliarize myself with this series, but I think I ended up
allocating a cluster per cpu per order. So one potential solution would be to
turn the flag into a size and store it in the cluster info. (In fact I think I
was doing that in an early version of this series - will have to look at why I
got rid of that). Then we could avoid needing to figure out nr_pages from the folio.
I ran some microbenchmarks to see if these extra operations cause a performance
issue - it all looks OK to me.
I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
mode, which prepares the 1G memory mapping by first paging it out with
MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
swap slots have 2 references, then it measures the duration of munmap() in the
parent on the entire range. The idea is that free_swap_and_cache() is called for
each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
will return true, due to the child's references, and __try_to_reclaim_swap() is
not called. After my change, we no longer have this short cut.
In both cases the results are within 1% (confirmed across multiple runs of 20
seconds each):
mm-stable: Average: 0.004997
+ change: Average: 0.005037
(these numbers are for Ampere Altra. I also tested on M2 VM - no regression
their either).
Do you still have a concern about this change?