On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 11.04.24 17:32, Zi Yan wrote:
From: Zi Yan <ziy@xxxxxxxxxx>
In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. It is possible that
the folio is unmapped fully, but it is unnecessary to add the folio
to deferred split list at all. Fix it by checking folio mapcount before
adding a folio to deferred split list.
Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
---
mm/rmap.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..d599a772e282 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
enum rmap_level level)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
- int last, nr = 0, nr_pmdmapped = 0;
+ int last, nr = 0, nr_pmdmapped = 0, mapcount = 0;
enum node_stat_item idx;
__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
break;
}
- atomic_sub(nr_pages, &folio->_large_mapcount);
+ mapcount = atomic_sub_return(nr_pages,
+ &folio->_large_mapcount) + 1;
That becomes a new memory barrier on some archs. Rather just re-read it
below. Re-reading should be fine here.
do {
last = atomic_add_negative(-1, &page->_mapcount);
if (last) {
@@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
* is still mapped.
*/
if (folio_test_large(folio) && folio_test_anon(folio))
- if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
+ if ((level == RMAP_LEVEL_PTE &&
+ mapcount != 0) ||
+ (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped))
deferred_split_folio(folio);
}
But I do wonder if we really care? Usually the folio will simply get
freed afterwards, where we simply remove it from the list.
If it's pinned, we won't be able to free or reclaim, but it's rather a
corner case ...
Is it really worth the added code? Not convinced.
It is actually not only an optimization, but also fixed the broken
thp_deferred_split_page counter in /proc/vmstat.
The counter actually counted the partially unmapped huge pages (so
they are on deferred split queue), but it counts the fully unmapped
mTHP as well now. For example, when a 64K THP is fully unmapped, the
thp_deferred_split_page is not supposed to get inc'ed, but it does
now.
The counter is also useful for performance analysis, for example,
whether a workload did a lot of partial unmap or not. So fixing the
counter seems worthy. Zi Yan should have mentioned this in the commit
log.