[RFC] mm, swap: VMA based swap readahead

From: Huang, Ying
Date: Mon Mar 06 2017 - 00:55:25 EST


From: Huang Ying <ying.huang@xxxxxxxxx>

The swap readahead is a important mechanism to reduce the swap in
latency. But the original swap readahead algorithm has some issues.

a) The original swap readahead algorithm does readahead based on the
consecutive blocks in swap device. But the consecutive blocks in
swap device just reflect the order of page reclaiming, but don't
necessarily reflect the access sequence in RAM. Although for some
workloads (such as whole system pure sequential workload), it
works, but in general, it doesn't work well for more complex access
pattern, such as combined workloads (sequential and random workload
runs together) or changing accessing patterns in workload.

b) The original swap readahead algorithm just turns random read into
sequential read, but doesnât parallel CPU and disk operations.
This can reduce the average latency and the lowest latency, but
doesnât help much for high percentile latency (like 90%, 95%,
99%).

In this patch, a VMA based swap readahead algorithm is implemented.
When the page fault occurs, the pages along the access direction will
be checked and the swapped out pages will be readahead if they fall
inside the readahead window. There is one readahead window state for
each VMA, to reflect the different access patterns for different VMAs.
The readahead window is scaled based on whether the accessed page is
consecutive in virtual address with the previous accessed page. If
so, the readahead window will be extended, otherwise, the readahead
window will be shrunk rapidly.

The test and comparison result is as below.

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Pure sequential access pattern workload
=======================================

Test case
---------

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory write until 300
seconds, for each page accessed, 24us is delayed to emulate CPU
operations on the page. The first round writing will trigger swap
out, the following rounds will trigger sequential swap in and out.

Result
------

Base kernel:

samples: 100952, average: 40.08us, min: 0us, max: 16.2ms
50th: 1us, 60th: 1us, 70th: 1us, 80th: 3us, 90th: 158us, 95th: 181us, 99th: 254us

Optimized kernel:

samples: 118509, average: 12.04us, min: 0us, max: 17.42ms
50th: 1us, 60th: 2us, 70th: 2us, 80th: 2us, 90th: 8us, 95th: 11us, 99th: 42us

The VMA based swap readahead has much lower latency. Mainly because
VMA based swap readahead algorithm parallels CPU operations and disk
IO, while the original swap readahead algorithm doesnât do that.

Pure random access pattern workload
===================================

Test Case
---------

vm-scalability, random swap test case, 4 processes to eat 50G virtual
memory space, repeat the memory random write until 300 seconds, for
each page accessed, 24us is delayed to emulate CPU operations on the
page. The first round writing will trigger swap out, the following
rounds will trigger random swap in.

Result
------

Base kernel:

samples: 70757, average: 70.71us, min: 0us, max: 15.19ms
50th: 78us, 60th: 83us, 70th: 91us, 80th: 93us, 90th: 97us, 95th: 103us, 99th: 120us

Optimized kernel:

samples: 70842, average: 72.55us, min: 0us, max: 15.47ms
50th: 79us, 60th: 83us, 70th: 92us, 80th: 94us, 90th: 97us, 95th: 104us, 99th: 122us

Almost exact same for all results. Both are good at avoiding readahead.

Combined access pattern workload
================================

Test Case
---------

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory write until 300
seconds, for each page accessed, 24us is delayed to emulate CPU
operations on the page. The first round writing will trigger swap
out, the following round will trigger sequential swap in and out.

As background noise, vm-scalability, random swap test case, 8
processes to eat 30G virtual memory space, repeat the random memory
write until 300 seconds, for each page accessed, 24us is delayed to
emulate CPU operations on the page. It will trigger random swap in in
the background.

All in all, after reaching relative stable state, there are sequential
and random swap in.

Result (sequential workload)
----------------------------

Base kernel:

samples: 228658, average: 68.52us, min: 0us, max: 23.67ms
50th: 2us, 60th: 42us, 70th: 74us, 80th: 106us, 90th: 150us, 95th: 181us, 99th: 263us

Optimized kernel:

samples: 271851, average: 26.8us, min: 0us, max: 21.24ms
50th: 1us, 60th: 1us, 70th: 2us, 80th: 3us, 90th: 58us, 95th: 115us, 99th: 225us

The original swap readahead algorithm is confused by the background
random access workload, so readahead is not enough for the sequential
access workload. The VMA-base readahead algorithm works much better.

Signed-off-by: Huang Ying <ying.huang@xxxxxxxxx>
---
include/linux/mm_types.h | 1 +
include/linux/swap.h | 42 +++++++
mm/memory.c | 14 ++-
mm/swap_state.c | 281 ++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 329 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f6d440ad785..23ca6f9804dc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -351,6 +351,7 @@ struct vm_area_struct {
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */

+ atomic_long_t swap_readahead_info;
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd6716d..44bb8fdfbafe 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,6 +250,26 @@ struct swap_info_struct {
struct swap_cluster_list discard_clusters; /* discard clusters list */
};

+#ifdef CONFIG_64BIT
+#define SWAP_RA_ORDER_CEILING 9
+#else
+/* Avoid stack overflow, because we need to save part of page table */
+#define SWAP_RA_ORDER_CEILING 2
+/* 2 * window - 1 will be checked for readahead */
+#define SWAP_RA_PTE_CACHE_SIZE ((1 << (1 + SWAP_RA_ORDER_CEILING)) - 1)
+#endif
+
+struct vma_swap_readahead {
+ int direction;
+ int win;
+ int nr_pte;
+#ifdef CONFIG_64BIT
+ pte_t *ptes;
+#else
+ pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
+#endif
+};
+
/* linux/mm/workingset.c */
void *workingset_eviction(struct address_space *mapping, struct page *page);
bool workingset_refault(void *shadow);
@@ -348,6 +368,7 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *,
#define SWAP_ADDRESS_SPACE_SHIFT 14
#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
extern struct address_space *swapper_spaces[];
+extern bool swap_vma_readahead;
#define swap_address_space(entry) \
(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>> SWAP_ADDRESS_SPACE_SHIFT])
@@ -369,6 +390,13 @@ extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
extern struct page *swapin_readahead(swp_entry_t, gfp_t,
struct vm_area_struct *vma, unsigned long addr);

+extern void swap_readahead_detect(struct vm_fault *vmf,
+ struct vma_swap_readahead *swap_ra);
+extern struct page *do_swap_page_readahead(struct vm_fault *vmf,
+ struct vma_swap_readahead *swap_ra,
+ swp_entry_t fentry,
+ struct page *fpage);
+
/* linux/mm/swapfile.c */
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
@@ -421,6 +449,7 @@ extern void swapcache_free_batch(swp_entry_t *entries, int n);
#define total_swap_pages 0L
#define total_swapcache_pages() 0UL
#define vm_swap_full() 0
+#define swap_vma_readahead 0

#define si_swapinfo(val) \
do { (val)->freeswap = (val)->totalswap = 0; } while (0)
@@ -466,6 +495,19 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
return NULL;
}

+static inline void swap_readahead_detect(struct vm_fault *vmf,
+ struct vma_swap_readahead *swap_ra)
+{
+}
+
+static inline struct page *do_swap_page_readahead(struct vm_fault *vmf,
+ struct vma_swap_readahead *swap_ra,
+ swp_entry_t fentry,
+ struct page *fpage)
+{
+ return NULL;
+}
+
static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
{
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 7f26782de4d5..670bb0c864a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2558,12 +2558,15 @@ int do_swap_page(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct page *page, *swapcache;
struct mem_cgroup *memcg;
+ struct vma_swap_readahead swap_ra;
swp_entry_t entry;
pte_t pte;
int locked;
int exclusive = 0;
int ret = 0;

+ if (swap_vma_readahead)
+ swap_readahead_detect(vmf, &swap_ra);
if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
goto out;

@@ -2583,8 +2586,12 @@ int do_swap_page(struct vm_fault *vmf)
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma,
- vmf->address);
+ if (swap_vma_readahead)
+ page = do_swap_page_readahead(vmf, &swap_ra,
+ entry, NULL);
+ else
+ page = swapin_readahead(entry,
+ GFP_HIGHUSER_MOVABLE, vma, vmf->address);
if (!page) {
/*
* Back out if somebody else faulted in this pte
@@ -2611,7 +2618,8 @@ int do_swap_page(struct vm_fault *vmf)
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
swapcache = page;
goto out_release;
- }
+ } else if (swap_vma_readahead)
+ do_swap_page_readahead(vmf, &swap_ra, entry, page);

swapcache = page;
locked = swap_lock_page_or_retry(page, vma->vm_mm, vmf->flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2126e9ba23b2..5aaae823b8ae 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -36,6 +36,7 @@ static const struct address_space_operations swap_aops = {

struct address_space *swapper_spaces[MAX_SWAPFILES];
static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
+bool swap_vma_readahead = true;

#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)

@@ -295,13 +296,15 @@ struct page * lookup_swap_cache(swp_entry_t entry)

page = find_get_page(swap_address_space(entry), swp_offset(entry));

- if (page) {
- INC_CACHE_INFO(find_success);
- if (TestClearPageReadahead(page))
- atomic_inc(&swapin_readahead_hits);
- }
+ if (!swap_vma_readahead) {
+ if (page) {
+ INC_CACHE_INFO(find_success);
+ if (TestClearPageReadahead(page))
+ atomic_inc(&swapin_readahead_hits);
+ }

- INC_CACHE_INFO(find_total);
+ INC_CACHE_INFO(find_total);
+ }
return page;
}

@@ -561,3 +564,269 @@ void exit_swap_address_space(unsigned int type)
synchronize_rcu();
kvfree(spaces);
}
+
+#ifdef CONFIG_64BIT
+#define SWAP_RA_ORDER_DEFAULT 3
+#else
+#define SWAP_RA_ORDER_DEFAULT 2
+#endif
+
+static int swap_ra_order_max = SWAP_RA_ORDER_DEFAULT;
+
+#define SWAP_RA_ADDR(v) ((v) & PAGE_MASK)
+#define SWAP_RA_HITS(v) ((v) & (PAGE_SIZE - 1))
+
+#define SWAP_RA_VAL(addr, hits) ((addr) | (hits))
+
+static inline int swap_ra_hits_max(int max_order)
+{
+ return 2 * max_order;
+}
+
+static inline int swap_ra_hits_to_order(int hits)
+{
+ return hits / 2;
+}
+
+static inline int swap_ra_detect_window(int max_order)
+{
+ return 1 << (max_order + PAGE_SHIFT);
+}
+
+static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
+ unsigned long faddr,
+ unsigned long addr,
+ unsigned long *start,
+ unsigned long *end)
+{
+ *start = max3(min(addr, faddr) + PAGE_SIZE,
+ vma->vm_start, faddr & PMD_MASK) >> PAGE_SHIFT;
+ *end = min3(max(addr, faddr),
+ vma->vm_end, (faddr & PMD_MASK) + PMD_SIZE) >> PAGE_SHIFT;
+}
+
+void swap_readahead_detect(struct vm_fault *vmf,
+ struct vma_swap_readahead *swap_ra)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long swap_ra_info;
+ unsigned long addr, faddr;
+ unsigned long start, end, pfn;
+ pte_t *pte;
+ int max_order, hits, hit;
+#ifndef CONFIG_64BIT
+ pte_t *tpte;
+#endif
+
+ swap_ra_info = atomic_long_read(&vma->swap_readahead_info);
+ max_order = READ_ONCE(swap_ra_order_max);
+
+ addr = SWAP_RA_ADDR(swap_ra_info);
+ hits = SWAP_RA_HITS(swap_ra_info);
+ faddr = vmf->address & PAGE_MASK;
+ swap_ra->direction = faddr >= addr ? 1 : -1;
+
+ if (faddr == addr)
+ hit = 0;
+ else if (addr < vma->vm_start || addr >= vma->vm_end)
+ hit = -1;
+ else if (abs((long)(faddr - addr)) < swap_ra_detect_window(max_order)) {
+ hit = (addr & PMD_MASK) == (faddr & PMD_MASK);
+ swap_ra_clamp_pfn(vma, faddr, addr, &start, &end);
+ pte = vmf->pte - ((faddr >> PAGE_SHIFT) - start);
+ for (pfn = start; pfn != end; pfn++, pte++)
+ if (!pte_present(*pte)) {
+ hit = -1;
+ break;
+ }
+ } else
+ hit = -1;
+
+ /*
+ * If failed in the sequential access detection, shrink the
+ * readahead window rapidly. Otherwise keep or enlarge
+ * readahead window.
+ */
+ if (hit < 0)
+ hits /= 2;
+ else
+ hits = min(hits + hit, swap_ra_hits_max(max_order));
+
+ /* Update readahead information */
+ atomic_long_set(&vma->swap_readahead_info, SWAP_RA_VAL(faddr, hits));
+ swap_ra->win = 1 << swap_ra_hits_to_order(hits);
+
+ /* No readahead */
+ if (swap_ra->win == 1) {
+ swap_ra->nr_pte = 0;
+ return;
+ }
+
+ addr = faddr + (swap_ra->win << (1 + PAGE_SHIFT)) * swap_ra->direction;
+ swap_ra_clamp_pfn(vma, faddr, addr, &start, &end);
+
+ swap_ra->nr_pte = end - start;
+ pte = vmf->pte - ((faddr >> PAGE_SHIFT) - start);
+#ifdef CONFIG_64BIT
+ swap_ra->ptes = pte;
+#else
+ tpte = swap_ra->ptes;
+ for (pfn = start; pfn != end; pfn++)
+ *tpte++ = *pte++;
+#endif
+}
+
+struct page *do_swap_page_readahead(struct vm_fault *vmf,
+ struct vma_swap_readahead *swap_ra,
+ swp_entry_t fentry,
+ struct page *fpage)
+{
+ struct blk_plug plug;
+ struct vm_area_struct *vma = vmf->vma;
+ struct page *page;
+ unsigned long addr;
+ pte_t *pte, pentry;
+ gfp_t gfp_mask;
+ swp_entry_t entry;
+ int i, alloc = 0, count;
+ bool page_allocated;
+
+ addr = vmf->address & PAGE_MASK;
+ blk_start_plug(&plug);
+ if (!fpage) {
+ fpage = __read_swap_cache_async(fentry, GFP_HIGHUSER_MOVABLE,
+ vma, addr, &page_allocated);
+ if (!fpage) {
+ blk_finish_plug(&plug);
+ return NULL;
+ }
+ if (page_allocated) {
+ alloc++;
+ swap_readpage(fpage);
+ }
+ }
+ /* fault page has been checked */
+ count = 1;
+ addr += PAGE_SIZE * swap_ra->direction;
+ pte = swap_ra->ptes;
+ if (swap_ra->direction < 0)
+ pte += swap_ra->nr_pte - 1;
+ /* check 2 * window size, but readahead window size at most */
+ for (i = 0; i < swap_ra->nr_pte && alloc < swap_ra->win;
+ i++, count++, pte += swap_ra->direction,
+ addr += PAGE_SIZE * swap_ra->direction) {
+ /* Page available for window size, skip readahead */
+ if (!alloc && count >= swap_ra->win)
+ break;
+ pentry = *pte;
+ if (pte_none(pentry))
+ break;
+ if (pte_present(pentry))
+ continue;
+ entry = pte_to_swp_entry(pentry);
+ if (unlikely(non_swap_entry(entry)))
+ break;
+ /* Avoid page allocation latency from readahead */
+ gfp_mask = __GFP_KSWAPD_RECLAIM | __GFP_HIGHMEM |
+ __GFP_MOVABLE | __GFP_HARDWALL | __GFP_NOWARN;
+ page = __read_swap_cache_async(entry, gfp_mask, vma, addr,
+ &page_allocated);
+ /*
+ * Memory allocation failure, avoid to put too much overhead
+ * on memory allocator because of readahead
+ */
+ if (!page)
+ break;
+ if (page_allocated) {
+ alloc++;
+ swap_readpage(page);
+ }
+ put_page(page);
+ }
+ blk_finish_plug(&plug);
+ /* Push any new pages onto the LRU now */
+ if (alloc)
+ lru_add_drain();
+
+ return fpage;
+}
+
+#ifdef CONFIG_SYSFS
+static ssize_t vma_ra_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%s\n", swap_vma_readahead ? "true" : "false");
+}
+static ssize_t vma_ra_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1))
+ swap_vma_readahead = true;
+ else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1))
+ swap_vma_readahead = false;
+ else
+ return -EINVAL;
+
+ return count;
+}
+static struct kobj_attribute vma_ra_enabled_attr =
+ __ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show,
+ vma_ra_enabled_store);
+
+static ssize_t vma_ra_max_order_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", swap_ra_order_max);
+}
+static ssize_t vma_ra_max_order_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err, v;
+
+ err = kstrtoint(buf, 10, &v);
+ if (err || v > UINT_MAX)
+ return -EINVAL;
+
+ swap_ra_order_max = max(min(v, SWAP_RA_ORDER_CEILING), 0);
+
+ return count;
+}
+static struct kobj_attribute vma_ra_max_order_attr =
+ __ATTR(vma_ra_max_order, 0644, vma_ra_max_order_show,
+ vma_ra_max_order_store);
+
+static struct attribute *swap_attrs[] = {
+ &vma_ra_enabled_attr.attr,
+ &vma_ra_max_order_attr.attr,
+ NULL,
+};
+
+static struct attribute_group swap_attr_group = {
+ .attrs = swap_attrs,
+};
+
+static int __init swap_init_sysfs(void)
+{
+ int err;
+ struct kobject *swap_kobj;
+
+ swap_kobj = kobject_create_and_add("swap", mm_kobj);
+ if (unlikely(!swap_kobj)) {
+ pr_err("failed to create swap kobject\n");
+ return -ENOMEM;
+ }
+ err = sysfs_create_group(swap_kobj, &swap_attr_group);
+ if (err) {
+ pr_err("failed to register swap group\n");
+ goto delete_obj;
+ }
+ return 0;
+
+delete_obj:
+ kobject_put(swap_kobj);
+ return err;
+}
+subsys_initcall(swap_init_sysfs);
+#endif
--
2.11.0