[RFC 5/6] mm: decouple PG_dirty from MADV_FREE

From: Minchan Kim
Date: Wed Jun 03 2015 - 02:16:14 EST


Basically, MADV_FREE relies on dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, as a example, if swap-in by read fault happens,
page table entry doesn't have dirty bit so MADV_FREE could discard
the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty since they are removed from swap cache so VM cannot consider
those pages as freeable any more alghouth madvise_free is called in future.
Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and removed from swapcache.
page table doesn't mark dirty bit and page
descriptor includes PG_dirty
..
..
madvise_free(ptr);
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

So, rather than relying on the PG_dirty of page descriptor
for preventing discarding a page, dirty bit in page table is more
straightforward and simple.

Now, every anonymous page handling(ex, anon/swap/cow fault handling,
KSM, THP, Migration) takes care of pte dirty bit to keep it so
we don't need to check PG_dirty to identify MADV_FREE hinted page
so this patch removes PageDirty check.

With this, it removes complicated logic and makes freeable page
checking as well as solving above mentioned problem.

Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
---
mm/rmap.c | 2 +-
mm/vmscan.c | 3 +--
2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 9c045940ed10..a2e4f64c392e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1280,7 +1280,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

if (flags & TTU_FREE) {
VM_BUG_ON_PAGE(PageSwapCache(page), page);
- if (!dirty && !PageDirty(page)) {
+ if (!dirty) {
/* It's a freeable page by MADV_FREE */
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 37e90db1520b..c5fbb7c64deb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_KEEP;
}

- if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
- !PageDirty(page))
+ if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
*freeable = true;

/* Reclaim if clean, defer dirty pages to writeback */
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/