Re: [PATCH] mm: fix page cache convergence regression

From: Johannes Weiner
Date: Thu May 30 2019 - 12:19:30 EST


Are there any objections or feedback on the proposed fix below? This
is kind of a serious regression.

On Fri, May 24, 2019 at 01:39:00PM -0400, Johannes Weiner wrote:
> From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Fri, 24 May 2019 10:12:46 -0400
> Subject: [PATCH] mm: fix page cache convergence regression
>
> Since a28334862993 ("page cache: Finish XArray conversion"), on most
> major Linux distributions, the page cache doesn't correctly transition
> when the hot data set is changing, and leaves the new pages thrashing
> indefinitely instead of kicking out the cold ones.
>
> On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
> running stock Arch Linux:
>
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120086/153600 workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120268/153600 workingset-b
>
> workingset-b is a 600M file on a 1G host that is otherwise entirely
> idle. No matter how often it's being accessed, it won't get cached.
>
> While investigating, I noticed that the non-resident information gets
> aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
> a problem because a workingset transition like this relies on the
> non-resident information tracked in the page cache tree of evicted
> file ranges: when the cache faults are refaults of recently evicted
> cache, we challenge the existing active set, and that allows a new
> workingset to establish itself.
>
> Tracing the shrinker that maintains this memory revealed that all page
> cache tree nodes were allocated to the root cgroup. This is a problem,
> because 1) the shrinker sizes the amount of non-resident information
> it keeps to the size of the cgroup's other memory and 2) on most major
> Linux distributions, only kernel threads live in the root cgroup and
> everything else gets put into services or session groups:
>
> [root@ham ~]# cat /proc/self/cgroup
> 0::/user.slice/user-0.slice/session-c1.scope
>
> As a result, we basically maintain no non-resident information for the
> workloads running on the system, thus breaking the caching algorithm.
>
> Looking through the code, I found the culprit in the above-mentioned
> patch: when switching from the radix tree to xarray, it dropped the
> __GFP_ACCOUNT flag from the tree node allocations - the flag that
> makes sure the allocated memory gets charged to and tracked by the
> cgroup of the calling process - in this case, the one doing the fault.
>
> To fix this, allow xarray users to specify per-tree flag that makes
> xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
> tree annotation to request such cgroup tracking for the cache nodes.
>
> With this patch applied, the page cache correctly converges on new
> workingsets again after just a few iterations:
>
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 124607/153600 workingset-a
> 87876/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 81313/153600 workingset-a
> 133321/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 63036/153600 workingset-a
> 153600/153600 workingset-b
>
> Cc: stable@xxxxxxxxxxxxxxx # 4.20+
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> fs/inode.c | 2 +-
> include/linux/xarray.h | 1 +
> lib/xarray.c | 12 ++++++++++--
> 3 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index e9d18b2c3f91..cd67859dbaf1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -361,7 +361,7 @@ EXPORT_SYMBOL(inc_nlink);
>
> static void __address_space_init_once(struct address_space *mapping)
> {
> - xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
> + xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> init_rwsem(&mapping->i_mmap_rwsem);
> INIT_LIST_HEAD(&mapping->private_list);
> spin_lock_init(&mapping->private_lock);
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0e01e6129145..5921599b6dc4 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -265,6 +265,7 @@ enum xa_lock_type {
> #define XA_FLAGS_TRACK_FREE ((__force gfp_t)4U)
> #define XA_FLAGS_ZERO_BUSY ((__force gfp_t)8U)
> #define XA_FLAGS_ALLOC_WRAPPED ((__force gfp_t)16U)
> +#define XA_FLAGS_ACCOUNT ((__force gfp_t)32U)
> #define XA_FLAGS_MARK(mark) ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
> (__force unsigned)(mark)))
>
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 6be3acbb861f..446b956c9188 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
> xas_destroy(xas);
> return false;
> }
> + if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> + gfp |= __GFP_ACCOUNT;
> xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> if (!xas->xa_alloc)
> return false;
> @@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
> xas_destroy(xas);
> return false;
> }
> + if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> + gfp |= __GFP_ACCOUNT;
> if (gfpflags_allow_blocking(gfp)) {
> xas_unlock_type(xas, lock_type);
> xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> @@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
> if (node) {
> xas->xa_alloc = NULL;
> } else {
> - node = kmem_cache_alloc(radix_tree_node_cachep,
> - GFP_NOWAIT | __GFP_NOWARN);
> + gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
> +
> + if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> + gfp |= __GFP_ACCOUNT;
> +
> + node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> if (!node) {
> xas_set_err(xas, -ENOMEM);
> return NULL;
> --
> 2.21.0
>