Re: [PATCH -mm -v8 1/3] mm, THP, swap: Delay splitting THP during swap out

From: Johannes Weiner
Date: Mon Apr 17 2017 - 14:24:29 EST


On Sat, Apr 15, 2017 at 09:17:04AM +0800, Huang, Ying wrote:
> Hi, Johannes,
>
> Johannes Weiner <hannes@xxxxxxxxxxx> writes:
>
> > Hi Huang,
> >
> > I reviewed this patch based on the feedback I already provided, but
> > eventually gave up and rewrote it. Please take review feedback more
> > seriously in the future.
>
> Thanks a lot for your help! I do respect all your review and effort.
> The -v8 patch doesn't take all your comments, just because I thought we
> have not reach consensus for some points and I want to use -v8 patch to
> discuss them.
>
> One concern I have before is whether to split THP firstly when swap
> space or memcg swap is used up. Now I think your solution is
> acceptable. And if we receive any regression report for that in the
> future, it's not very hard to deal with.

If you look at get_scan_count(), we'll stop scanning anonymous pages
altogether when swap space runs out. So it might happen to a few THP,
but it shouldn't be a big deal.

And yes, in that case I'd really rather wait for any real problems to
materialize before we complicate things.

> > Attached below is the reworked patch. Most changes are to the layering
> > (page functions, cluster functions, range functions) so that we don't
> > make the lowest swap range code require a notion of huge pages, or
> > make the memcg page functions take size information that can be
> > gathered from the page itself. I turned the config symbol into a
> > generic THP_SWAP that can later be extended when we add 2MB IO. The
> > rest is function naming, #ifdef removal etc.
>
> For some #ifdef in swapfile.c, it is to avoid unnecessary code size
> increase for !CONFIG_TRANSPARENT_HUGEPAGE or platform with THP swap
> optimization disabled. Is it an issue?

It saves some code size, but it looks like the biggest cost comes from
bloating PageSwapCache(). This is mm/builtin.o with !CONFIG_THP:

add/remove: 1/0 grow/shrink: 34/5 up/down: 920/-311 (609)
function old new delta
__free_cluster - 106 +106
get_swap_pages 465 555 +90
migrate_page_move_mapping 1404 1479 +75
shrink_page_list 3573 3632 +59
__delete_from_swap_cache 235 293 +58
__test_set_page_writeback 626 678 +52
__swap_writepage 766 812 +46
try_to_unuse 1763 1795 +32
madvise_free_pte_range 882 912 +30
__set_page_dirty_nobuffers 245 268 +23
migrate_page_copy 565 587 +22
swap_slot_free_notify 133 151 +18
shmem_replace_page 616 633 +17
try_to_free_swap 135 151 +16
test_clear_page_writeback 512 528 +16
swap_set_page_dirty 109 125 +16
swap_readpage 384 400 +16
shmem_unuse 1535 1551 +16
reuse_swap_page 340 356 +16
page_mapping 144 160 +16
migrate_huge_page_move_mapping 483 499 +16
free_swap_and_cache 409 425 +16
free_pages_and_swap_cache 161 177 +16
free_page_and_swap_cache 145 161 +16
do_swap_page 1216 1232 +16
__remove_mapping 408 424 +16
__page_file_mapping 82 98 +16
__page_file_index 70 85 +15
try_to_unmap_one 1324 1337 +13
shmem_getpage_gfp.isra 2358 2371 +13
add_to_swap_cache 47 60 +13
inc_cluster_info_page 204 210 +6
get_swap_page 411 415 +4
shmem_writepage 922 925 +3
sys_swapon 4210 4211 +1
swapcache_free_entries 786 768 -18
__add_to_swap_cache 445 406 -39
delete_from_swap_cache 149 104 -45
scan_swap_map_slots 1953 1889 -64
swap_do_scheduled_discard 713 568 -145
Total: Before=454535, After=455144, chg +0.13%

If I make the compound_head() in there conditional, this patch
actually ends up shrinking the code due to the refactoring of the
cluster functions:

add/remove: 1/0 grow/shrink: 10/5 up/down: 302/-327 (-25)
function old new delta
__free_cluster - 106 +106
get_swap_pages 465 555 +90
__delete_from_swap_cache 235 277 +42
shmem_replace_page 616 633 +17
migrate_page_move_mapping 1404 1418 +14
add_to_swap_cache 47 60 +13
migrate_page_copy 565 571 +6
inc_cluster_info_page 204 210 +6
get_swap_page 411 415 +4
shmem_writepage 922 925 +3
sys_swapon 4210 4211 +1
swapcache_free_entries 786 768 -18
delete_from_swap_cache 149 104 -45
__add_to_swap_cache 445 390 -55
scan_swap_map_slots 1953 1889 -64
swap_do_scheduled_discard 713 568 -145
Total: Before=454535, After=454510, chg -0.01%

But PageSwapCache() is somewhat ugly either way. Even with THP_SWAP
compiled in, it seems like most callsites wouldn't test tailpages?
Can we get rid of the compound_head() and annotate any callsites
working on potential tail pages?

> > Please review whether this is an acceptable version for you.
>
> Yes. It is good for me. I will give it more test on next Monday.

Thanks

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4acd6c4f808..d33e3280c8ad 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,7 +326,9 @@ PAGEFLAG_FALSE(HighMem)
#ifdef CONFIG_SWAP
static __always_inline int PageSwapCache(struct page *page)
{
+#ifdef CONFIG_THP_SWAP
page = compound_head(page);
+#endif
return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);

}
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 34adac6e9457..a4dba6975e7b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -401,7 +401,6 @@ extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
extern void swapcache_free(swp_entry_t);
-extern void swapcache_free_cluster(swp_entry_t);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -467,10 +466,6 @@ static inline void swapcache_free(swp_entry_t swp)
{
}

-static inline void swapcache_free_cluster(swp_entry_t swp)
-{
-}
-
static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr)
{
@@ -592,5 +587,13 @@ static inline bool mem_cgroup_swap_full(struct page *page)
}
#endif

+#ifdef CONFIG_THP_SWAP
+extern void swapcache_free_cluster(swp_entry_t);
+#else
+static inline void swapcache_free_cluster(swp_entry_t swp)
+{
+}
+#endif
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f597cabcaab7..eeaf145b2a20 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -825,6 +825,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
return n_ret;
}

+#ifdef CONFIG_THP_SWAP
static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
{
unsigned long idx;
@@ -862,6 +863,13 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
unlock_cluster(ci);
swap_range_free(si, offset, SWAPFILE_CLUSTER);
}
+#else
+static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+{
+ VM_WARN_ON_ONCE(1);
+ return 0;
+}
+#endif /* CONFIG_THP_SWAP */

static unsigned long scan_swap_map(struct swap_info_struct *si,
unsigned char usage)
@@ -1145,6 +1153,7 @@ void swapcache_free(swp_entry_t entry)
}
}

+#ifdef CONFIG_THP_SWAP
void swapcache_free_cluster(swp_entry_t entry)
{
unsigned long offset = swp_offset(entry);
@@ -1170,6 +1179,7 @@ void swapcache_free_cluster(swp_entry_t entry)
swap_free_cluster(si, idx);
spin_unlock(&si->lock);
}
+#endif /* CONFIG_THP_SWAP */

void swapcache_free_entries(swp_entry_t *entries, int n)
{