Re: [RFC 0/6]mm: add new LRU list for MADV_FREE pages

From: Minchan Kim
Date: Wed Feb 01 2017 - 00:53:31 EST


Hi Shaohua,

On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> Hi,
>
> We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
> solving the issues, jemalloc can't use the MADV_FREE feature.
> - Doesn't support system without swap enabled. Because if swap is off, we can't
> or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
> with other anonymous pages, we can't reclaim MADV_FREE pages. In current
> implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
> But in our environment, a lot of machines don't enable swap. This will prevent
> our setup using MADV_FREE.
> - Increases memory pressure. page reclaim bias file pages reclaim against
> anonymous pages. This doesn't make sense for MADV_FREE pages, because those
> pages could be freed easily and refilled with very slight penality. Even page
> reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
> pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
> page, we probably must scan a lot of other anonymous pages, which is
> inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing
> without it.
> - RSS accounting. MADV_FREE pages are accounted as normal anon pages and
> reclaimed lazily, so application's RSS becomes bigger. This confuses our
> workloads. We have monitoring daemon running and if it finds applications' RSS
> becomes abnormal, the daemon will kill the applications even kernel can reclaim
> the memory easily. Currently we don't export separate RSS accounting for
> MADV_FREE pages. This will prevent our setup using MADV_FREE too.
>
> For the first two issues, introducing a new LRU list for MADV_FREE pages could
> solve the issues. We can directly reclaim MADV_FREE pages without writting them
> out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
> the new list, page reclaim can easily reclaim such pages without interference
> of file or anonymous pages. The memory pressure issue will disappear.
>
> Actually Minchan posted patches to add the LRU list before, but he didn't
> pursue. So I picked up them and the patches are based on Minchan's previous
> patches. The main difference between my patches and Minchan previous patches is
> page reclaim policy. Minchan's patches introduces a knob to balance the reclaim
> of MADV_FREE pages and anon/file pages, while the patches always reclaim
> MADV_FREE pages first if there are. I described the reason in patch 5.

First of all, thanks for th effort to support MADV_FREE for swapless system,
Shaohua!

CCing Daniel,

The reason I have postponed is due to controverial part about balancing
used-once vs. madv_freed apges. I thought it doesn't make sense to reclaim
madv_freed pages first even if there are lots of used-once pages.

Recently, Johannes posted patches for balancing file/anon and it was based
on the cost model, IIRC. I wanted to base on it.

The idea is VM reclaims file-based pages and if refault happens, we can measure
refault distance and sizeof(LRU_LAZYFREE list). If refault distance is smaller
than lazyfree LRU list's size, it means the file-backed page have been kept
in memory if we have discarded lazyfree pages so it adds more cost to reclaim
lazyfree LRU list more agressively.

I tested your patch with simple MADV_FREE workload(just alloc and then repeated
touch/madv_free) with background stream-read process. In that case, the
MADV_FREE workload regressed in half without any gain for stream-read process.

I tested hacky code to simulate feedback loop I suggested idea and it restores
the performance regression. I'm not saying below hacky patch should merge in
but I think we should have used-once reclaim feedback logic to prevent
unnecessary purging for madv_freed pages.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 589a165..39d4bba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -703,6 +703,7 @@ typedef struct pglist_data {
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
+ bool lazyfree;
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f809f04..cf54b81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2364,22 +2364,25 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
struct blk_plug plug;
bool scan_adjusted;

- /* reclaim all lazyfree pages so don't apply priority */
- nr[LRU_LAZYFREE] = lruvec_lru_size(lruvec, LRU_LAZYFREE, sc->reclaim_idx);
- while (nr[LRU_LAZYFREE]) {
- nr_to_scan = min(nr[LRU_LAZYFREE], SWAP_CLUSTER_MAX);
- nr[LRU_LAZYFREE] -= nr_to_scan;
- nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec, sc,
- LRU_LAZYFREE);
-
- if (nr_reclaimed >= nr_to_reclaim)
- break;
- cond_resched();
- }
+ if (pgdat->lazyfree) {
+ /* reclaim all lazyfree pages so don't apply priority */
+ nr[LRU_LAZYFREE] = lruvec_lru_size(lruvec, LRU_LAZYFREE, sc->reclaim_idx);
+ while (nr[LRU_LAZYFREE]) {
+ nr_to_scan = min(nr[LRU_LAZYFREE], SWAP_CLUSTER_MAX);
+ nr[LRU_LAZYFREE] -= nr_to_scan;
+ nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec, sc,
+ LRU_LAZYFREE);
+
+ if (nr_reclaimed >= nr_to_reclaim)
+ break;
+ cond_resched();
+ }

- if (nr_reclaimed >= nr_to_reclaim) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
+ if (nr_reclaimed >= nr_to_reclaim) {
+ sc->nr_reclaimed += nr_reclaimed;
+ pgdat->lazyfree = false;
+ return;
+ }
}

get_scan_count(lruvec, memcg, sc, nr, lru_pages);
diff --git a/mm/workingset.c b/mm/workingset.c
index c573cb2..2a01b91 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -233,7 +233,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
bool workingset_refault(void *shadow)
{
unsigned long refault_distance;
- unsigned long active_file;
+ unsigned long active_file, lazyfree;
struct mem_cgroup *memcg;
unsigned long eviction;
struct lruvec *lruvec;
@@ -268,6 +268,7 @@ bool workingset_refault(void *shadow)
lruvec = mem_cgroup_lruvec(pgdat, memcg);
refault = atomic_long_read(&lruvec->inactive_age);
active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
+ lazyfree = lruvec_lru_size(lruvec, LRU_LAZYFREE, MAX_NR_ZONES);
rcu_read_unlock();

/*
@@ -290,6 +291,9 @@ bool workingset_refault(void *shadow)

inc_node_state(pgdat, WORKINGSET_REFAULT);

+ if (refault_distance <= lazyfree)
+ pgdat->lazyfree = true;
+
if (refault_distance <= active_file) {
inc_node_state(pgdat, WORKINGSET_ACTIVATE);
return true;

>
> For the third issue, we can add a separate RSS count for MADV_FREE pages. The
> count will be increased in madvise syscall and decreased in page reclaim (eg,
> unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> active page there. But there isn't mm_struct context at that place. Iterating
> vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
> somebody can share a hint how to fix this issue.
>
> Thanks,
> Shaohua
>
> Minchan previous patches:
> http://marc.info/?l=linux-mm&m=144800657002763&w=2
>
> Shaohua Li (6):
> mm: add wrap for page accouting index
> mm: add lazyfree page flag
> mm: add LRU_LAZYFREE lru list
> mm: move MADV_FREE pages into LRU_LAZYFREE list
> mm: reclaim lazyfree pages
> mm: enable MADV_FREE for swapless system
>
> drivers/base/node.c | 2 +
> drivers/staging/android/lowmemorykiller.c | 3 +-
> fs/proc/meminfo.c | 1 +
> fs/proc/task_mmu.c | 8 ++-
> include/linux/mm_inline.h | 41 +++++++++++++
> include/linux/mmzone.h | 9 +++
> include/linux/page-flags.h | 6 ++
> include/linux/swap.h | 2 +-
> include/linux/vm_event_item.h | 2 +-
> include/trace/events/mmflags.h | 1 +
> include/trace/events/vmscan.h | 31 +++++-----
> kernel/power/snapshot.c | 1 +
> mm/compaction.c | 11 ++--
> mm/huge_memory.c | 6 +-
> mm/khugepaged.c | 6 +-
> mm/madvise.c | 11 +---
> mm/memcontrol.c | 4 ++
> mm/memory-failure.c | 3 +-
> mm/memory_hotplug.c | 3 +-
> mm/mempolicy.c | 3 +-
> mm/migrate.c | 29 ++++------
> mm/page_alloc.c | 10 ++++
> mm/rmap.c | 7 ++-
> mm/swap.c | 51 +++++++++-------
> mm/vmscan.c | 96 +++++++++++++++++++++++--------
> mm/vmstat.c | 4 ++
> 26 files changed, 242 insertions(+), 109 deletions(-)
>
> --
> 2.9.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>