Re: [PATCH 0/2] don't use mapcount() to check large folio sharing

From: David Hildenbrand
Date: Fri Aug 04 2023 - 03:34:29 EST


On 04.08.23 02:17, Yin, Fengwei wrote:


On 8/4/2023 7:38 AM, Yu Zhao wrote:
On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:



On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:

"

On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote:


On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:


On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.

Use folio_estimated_sharers() here as the estimated number is enough.

Yin Fengwei (2):
madvise: don't use mapcount() against large folio for sharing check
madvise: don't use mapcount() against large folio for sharing check

mm/huge_memory.c | 2 +-
mm/madvise.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)


As a set of fixes, I agree this is definitely an improvement, so:

Reviewed-By: Ryan Roberts
Thanks.



But I have a couple of comments around further improvements;

Once we have the scheme that David is working on to be able to provide precise
exclusive vs shared info, we will probably want to move to that. Although that
scheme will need access to the mm_struct of a process known to be mapping the
folio. We have that info, but its not passed to folio_estimated_sharers() so we
can't just reimplement folio_estimated_sharers() - we will need to rework these
call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.

What you have is definitely an improvement over what was there before. And is
probably the best we can do without David's scheme. So I wouldn't delay this.
Just pointing out that we will be able to make it even better later on (if
David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.

I was suggesting the opposite - not waiting. Then we can do separate improvement
later.
Let's wait for David's work ready.

Waiting is fine as long as we don't miss the next merge window -- we
don't want these two bugs to get into another release. Also I think we
should cc stable, since as David mentioned, they have been causing
selftest failures.

Stable was CCed.

Need to add the "Cc: stable@xxxxxxxxxxxxxxx" tag:
Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).

I'd like to wait for answer from Andrew whether these patches are suitable
for stable (I suppose you think so) branch.

Note that the COW test does not fail -- it skips -- but the behavir changed:

$ ./cow
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..190
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
ok 4 No leak from parent into child
# [RUN] Basic COW after fork() ... with PTE-mapped THP
ok 5 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with single PTE of THP
ok 7 No leak from parent into child
# [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
ok 8 No leak from parent into child
# [RUN] Basic COW after fork() ... with partially mremap()'ed THP
ok 9 No leak from parent into child
# [RUN] Basic COW after fork() ... with partially shared THP
ok 10 No leak from parent into child
...

Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).

The code that broke it is

commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
Author: Vishal Moola (Oracle) <vishal.moola@xxxxxxxxx>
Date: Wed Dec 21 10:08:46 2022 -0800

madvise: convert madvise_cold_or_pageout_pte_range() to use folios
This change removes a number of calls to compound_head(), and saves
1729 bytes of kernel text.
Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@xxxxxxxxx
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@xxxxxxxxx>
Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
Cc: SeongJae Park <sj@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>


Ever since v6.3.

The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
conversion.


Probably all that is information worth having in the patch description.

--
Cheers,

David / dhildenb