Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3

From: Mel Gorman
Date: Wed Dec 18 2013 - 10:00:49 EST


On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> > Hi Mel,
> >
> > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > > This series is currently untested and is being posted to sync up discussions
> > > on the treatment of page cache pages, particularly the sysv part. I have
> > > not thought it through in detail but postings patches is the easiest way
> > > to highlight where I think a problem might be.
> > >
> > > Changelog since v2
> > > o Drop an accounting patch, behaviour is deliberate
> > > o Special case tmpfs and shmem pages for discussion
> > >
> > > Changelog since v1
> > > o Fix lot of brain damage in the configurable policy patch
> > > o Yoink a page cache annotation patch
> > > o Only account batch pages against allocations eligible for the fair policy
> > > o Add patch that default distributes file pages on remote nodes
> > >
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of how
> > > the page allocator and kswapd interacted on the per-zone LRU lists.
> >
> > Not just that, it was about ensuring predictable cache replacement and
> > maximizing the cache's effectiveness. This implicitely fixed the
> > kswapd interaction bug, but that was not the sole reason (I realize
> > that the original changelog is incomplete and I apologize for that).
> >
> > I have had offline discussions with Andrea back then and his first
> > suggestion was too to make this a zone fairness placement that is
> > exclusive to the local node, but eventually he agreed that the problem
> > applies just as much on the global level and that we should apply
> > fairness throughout the system as long as we honor zone_reclaim_mode
> > and hard bindings. During our discussions now, it turned out that
> > zone_reclaim_mode is a terrible predictor for preferred locality, but
> > we also more or less agreed that the locality issues in the first
> > place are not really applicable to cache loads dominated by IO cost.
> >
> > So I think the main discrepancy between the original patch and what we
> > truly want is that aging fairness is really only relevant for actual
> > cache backed by secondary storage, because cache replacement is an
> > ongoing operation that involves IO. As opposed to memory types that
> > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> > (slab, kernel allocations), in which case we prefer NUMA locality.
> >
> > > Unfortunately a side-effect missed during review was that it's now very
> > > easy to allocate remote memory on NUMA machines. The problem is that
> > > it is not a simple case of just restoring local allocation policies as
> > > there are genuine reasons why global page aging may be prefereable. It's
> > > still a major change to default behaviour so this patch makes the policy
> > > configurable and sets what I think is a sensible default.
> > >
> > > The patches are on top of some NUMA balancing patches currently in -mm.
> > > It's untested and posted to discuss patches 4 and 6.
> >
> > It might be easier in dealing with -stable if we start with the
> > critical fix(es) to restore sane functionality as much and as compact
> > as possible and then place the cleanups on top?
> >
> > In my local tree, I have the following as the first patch:
>
> Updated version with your tmpfs __GFP_PAGECACHE parts added and
> documentation, changelog updated as necessary. I remain unconvinced
> that tmpfs pages should be round-robined, but I agree with you that it
> is the conservative change to do for 3.12 and 3.12 and we can figure
> out the rest later.

Assume you with 3.12 and 3.13 here.

> I sure hope that this doesn't drive most people
> on NUMA to disable pagecache interleaving right away as I expect most
> tmpfs workloads to see little to no reclaim and prefer locality... :/
>

I hope you're right but I expect the experience will be like
zone_reclaim_mode. We're going to be looking out for bug reports that
are "fixed" by disabling pagecache locality and pushing back on them by
fixing the real problem.

This was the experience with zone_reclaim_mode when it started going
wrong. It was also the experience with THP for a very long time.
Disabling THP was a workaround for all sorts of problems and it was very
important to fix them and push back on anyone documenting disabling THP
as a standard workaround.

> ---
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache
>

Monolithic patch with multiple changes but meh. I'm not pushed because I
know what the breakout looks like. FWIW, I had intended the entire of my
broken-out series for 3.12 and 3.13 once it got ironed out. I find the
series easier to understand but of course I would.

> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
> in order to ensure predictable pagecache replacement and to maximize
> the cache's effectiveness of reducing IO regardless of zone or node
> topology.
>
> However, it was overzealous in round-robin placing every type of
> allocation over all allowable nodes, instead of preferring locality,
> which resulted in severe regressions on certain NUMA workloads that
> have nothing to do with pagecache.
>
> This patch drastically reduces the impact of the original change by
> having the round-robin placement policy only apply to pagecache
> allocations and no longer to anonymous memory, shmem, slab and other
> types of kernel allocations.
>
> This still changes the long-standing behavior of pagecache adhering to
> the configured memory policy and preferring local allocations per
> default, so make it configurable in case somebody relies on it.
> However, we also expect the majority of users to prefer maximium cache
> effectiveness and a predictable replacement behavior over memory
> locality, so reflect this in the default setting of the sysctl.
>
> No-signoff-without-Mel's
> Cc: <stable@xxxxxxxxxx> # 3.12
> ---
> Documentation/sysctl/vm.txt | 20 ++++++++++++++++
> Documentation/vm/numa_memory_policy.txt | 7 ++++++
> include/linux/gfp.h | 4 +++-
> include/linux/pagemap.h | 2 +-
> include/linux/swap.h | 2 ++
> kernel/sysctl.c | 8 +++++++
> mm/filemap.c | 2 ++
> mm/page_alloc.c | 41 +++++++++++++++++++++++++--------
> mm/shmem.c | 14 +++++++++++
> 9 files changed, 88 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb7b64a..308c342f62ad 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> +- pagecache_mempolicy_mode
> - min_slab_ratio
> - min_unmapped_ratio
> - mmap_min_addr

Sure about the name?

This is a boolean and "mode" implies it might be a bitmask. That said, I
recognise that my own naming also sucked because complaining about yours
I can see that mine also sucks.

> @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly.
>
> =============================================================
>
> +pagecache_mempolicy_mode:
> +
> +This is available only on NUMA kernels.
> +
> +Per default, pagecache is allocated in an interleaving fashion over
> +all allowed nodes (hardbindings and zone_reclaim_mode excluded),
> +regardless of the selected memory policy.
> +
> +The assumption is that, when it comes to pagecache, users generally
> +prefer predictable replacement behavior regardless of NUMA topology
> +and maximizing the cache's effectiveness in reducing IO over memory
> +locality.
> +
> +This behavior can be changed by enabling pagecache_mempolicy_mode, in
> +which case page cache allocations will be placed according to the
> +configured memory policy (Documentation/vm/numa_memory_policy.txt).
> +

Ok this indicates that pagecache will still be interleaved on zones local
to the node the process is allocating on. Good because that preserves a
very important aspect of your original patch.

The current description feels a little backwards though -- "Enable this
to *not* interleave pagecache". This documented behaviour says to me
that pagecache_obey_mempolicy might be a better name if enabling it uses
the system default memory policy. However, even that might put us in a
corner. Ultimately we want this to be controllable on a per-process basis
using memory policies.

Merging what I have in v3, unreleased v4 and this thing I ended up with
this. The observation about cpusets was raised by Michal Hocko on IRC.

---8<---
mpol_interleave_files

This is available only on NUMA kernels.

Historically, the default behaviour of the system is to allocate memory
local to the process. The behaviour was usually modified through the use
of memory policies while zone_reclaim_mode controls how strict the local
memory allocation policy is.

Issues arise when the allocating process is frequently running on the same
node. The kernels memory reclaim daemon runs one instance per NUMA node.
A consequence is that relatively new memory may be reclaimed by kswapd when
the allocating process is running on a specific node. The user-visible
impact is that the system appears to do more IO than necessary when a
workload is accessing files that are larger than a given NUMA node.

To address this problem, the default system memory policy is modified by
this tunable.

When this tunable is enabled, the system default memory policy will
interleave batches of file-backed pages over all allowed zones and nodes.
The assumption is that, when it comes to file pages that users generally
prefer predictable replacement behavior regardless of NUMA topology and
maximizing the page cache's effectiveness in reducing IO over memory
locality.

The tunable zone_reclaim_mode overrides this and enabling zone_reclaim_mode
functionally disables mpol_interleave_pagecache.

A process running within a memory cpuset will obey the cpuset policy and
ignore mpol_interleave_files.

At the time of writing, this parameter cannot be overridden by a process
using set_mempolicy to set the task memory policy. Similarly, numactl
setting the task memory policy will not override this setting. This may
change in the future.

The tunable is default enabled and has two recognised parameters;

0: Use the MPOL_LOCAL policy as the system-wide default
1: Batch interleave file-backed allocations over all allowed nodes

One enabled, the downside is that some file accesses will now be to remote
memory even though the local node had available resources. This will hurt
workloads with small or short lived files that fit easily within one node.
The upside is that workloads working on files larger than a NUMA node will
not reclaim active pages prematurely.
---8<---

> +=============================================================
> +
> min_slab_ratio:
>
> This is available only on NUMA kernels.
> diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
> index 4e7da6543424..72247e565908 100644
> --- a/Documentation/vm/numa_memory_policy.txt
> +++ b/Documentation/vm/numa_memory_policy.txt
> @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When
> both cpusets and policies are applied to a task, the restrictions of the cpuset
> takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
>
> +Note that, per default, the memory policies do not apply to pagecache. Instead
> +it will be interleaved fairly over all allowable nodes (respecting hardbindings
> +and zone_reclaim_mode) in order to maximize the cache's effectiveness in
> +reducing IO and to ensure predictable cache replacement. Special setups that
> +require pagecache to adhere to the configured memory policy can change this
> +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
> +

Manual pages should also be updated.

> MEMORY POLICY CONCEPTS
>
> Scope of Memory Policies
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 9b4dd491f7e8..f69e4cb78ccf 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
> +#define ___GFP_PAGECACHE 0x2000000u
> /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
> /*
> @@ -92,6 +93,7 @@ struct vm_area_struct;
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
> +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
>
> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -99,7 +101,7 @@ struct vm_area_struct;
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75a078b..bda48453af8e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
> #else
> static inline struct page *__page_cache_alloc(gfp_t gfp)
> {
> - return alloc_pages(gfp, 0);
> + return alloc_pages(gfp | __GFP_PAGECACHE, 0);
> }
> #endif
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..3458994b0881 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> +extern int pagecache_mempolicy_mode;
> extern int sysctl_min_unmapped_ratio;
> extern int sysctl_min_slab_ratio;
> extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> #else
> #define zone_reclaim_mode 0
> +#define pagecache_mempolicy_mode 0
> static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
> {
> return 0;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 34a604726d0b..a8c56c1dc98e 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
> .extra1 = &zero,
> },
> {
> + .procname = "pagecache_mempolicy_mode",
> + .data = &pagecache_mempolicy_mode,
> + .maxlen = sizeof(pagecache_mempolicy_mode),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + .extra1 = &zero,
> + },
> + {
> .procname = "min_unmapped_ratio",
> .data = &sysctl_min_unmapped_ratio,
> .maxlen = sizeof(sysctl_min_unmapped_ratio),
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b7749a92021c..5bb922506906 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
> int n;
> struct page *page;
>
> + gfp |= __GFP_PAGECACHE;
> +
> if (cpuset_do_page_mem_spread()) {
> unsigned int cpuset_mems_cookie;
> do {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 580a5f075ed0..f7c0ecb5bb8b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1547,7 +1547,15 @@ again:
> get_pageblock_migratetype(page));
> }
>
> + /*
> + * All allocations eat into the round-robin batch, even
> + * allocations that are not subject to round-robin placement
> + * themselves. This makes sure that allocations that ARE
> + * subject to round-robin placement compensate for the
> + * allocations that aren't, to have equal placement overall.
> + */
> __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
> +
> __count_zone_vm_events(PGALLOC, zone, 1 << order);
> zone_statistics(preferred_zone, zone, gfp_flags);
> local_irq_restore(flags);

Thanks.

> @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>
> #ifdef CONFIG_NUMA
> /*
> + * pagecache_mempolicy_mode - whether pagecache allocations should
> + * honor the configured memory policy and allocate from the zonelist
> + * in order of preference, or whether they should interleave fairly
> + * over all allowed zones in the given zonelist to maximize cache
> + * effects and ensure predictable cache replacement.
> + */
> +int pagecache_mempolicy_mode __read_mostly;
> +
> +/*
> * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
> * skip over zones that are not allowed by the cpuset, or that have
> * been recently (in last second) found to be nearly full. See further
> @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>
> static bool zone_local(struct zone *local_zone, struct zone *zone)
> {
> - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> + return local_zone->node == zone->node;
> }

Does that not break on !CONFIG_NUMA?

It's why I used zone_to_nid

>
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> @@ -1908,22 +1925,25 @@ zonelist_scan:
> if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
> goto try_this_zone;
> /*
> - * Distribute pages in proportion to the individual
> - * zone size to ensure fair page aging. The zone a
> - * page was allocated in should have no effect on the
> - * time the page has in memory before being reclaimed.
> + * Distribute pagecache pages in proportion to the
> + * individual zone size to ensure fair page aging.
> + * The zone a page was allocated in should have no
> + * effect on the time the page has in memory before
> + * being reclaimed.
> *
> - * When zone_reclaim_mode is enabled, try to stay in
> - * local zones in the fastpath. If that fails, the
> + * When pagecache_mempolicy_mode or zone_reclaim_mode
> + * is enabled, try to allocate from zones within the
> + * preferred node in the fastpath. If that fails, the
> * slowpath is entered, which will do another pass
> * starting with the local zones, but ultimately fall
> * back to remote zones that do not partake in the
> * fairness round-robin cycle of this zonelist.
> */
> - if (alloc_flags & ALLOC_WMARK_LOW) {
> + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> + (gfp_mask & __GFP_PAGECACHE)) {
> if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> continue;

NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
some cases. Looks a bit fishy even though I can't quite put my finger on it.

I also got details wrong here in the v3 of the series. In an unreleased
v4 of the series I had corrected the treatment of slab pages in line
with your wishes and reused the broken out helper in prepare_slowpath to
keep the decision in sync.

It's still in development but even if it gets rejected it'll act as a
comparison point to yours.

> - if (zone_reclaim_mode &&
> + if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> !zone_local(preferred_zone, zone))
> continue;
> }

Documention says "enabling pagecache_mempolicy_mode, in which case page cache
allocations will be placed according to the configured memory policy". Should
that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.

Breaking this out would be more comprehensible.

On a semi-related note, we might encounter a problem later where the
interleaving causes us to skip over usable zones and zones with available
batches are !zone_dirty_ok. We'd fall back to the slowpatch resetting the
batches so it will not be particularly visible but there might be some
interactions there.

> @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
> * thrash fairness information for zones that are not
> * actually part of this zonelist's round-robin cycle.
> */
> - if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
> + if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> + !zone_local(preferred_zone, zone))
> continue;
> mod_zone_page_state(zone, NR_ALLOC_BATCH,
> high_wmark_pages(zone) -
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8297623fcaed..02d7a9c03463 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
> return page;
> }
>
> +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
> +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
> +{
> + /* If no internal shm_mount then it must be tmpfs */
> + if (IS_ERR(shm_mnt))
> + return true;
> +
> + /* Consider it to be tmpfs if the superblock is not the internal mount */
> + return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
> +}
> +
> static struct page *shmem_alloc_page(gfp_t gfp,
> struct shmem_inode_info *info, pgoff_t index)
> {
> @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
> pvma.vm_ops = NULL;
> pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
>
> + if (shmem_inode_on_tmpfs(info))
> + gfp |= __GFP_PAGECACHE;
> +
> page = alloc_page_vma(gfp, &pvma, 0);
>
> /* Drop reference taken by mpol_shared_policy_lookup() */

For what it's worth, this is what I've currently kicked off testes for

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/