Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page agingpolicy configurable

From: Mel Gorman
Date: Tue Dec 17 2013 - 10:30:07 EST


On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of
> > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > Unfortunately it was missed during review that a consequence is that
> > we also round-robin between NUMA nodes. This is bad for two reasons
> >
> > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > 2. It incurs an immediate remote memory performance hit in exchange
> > for a potential performance gain when memory needs to be reclaimed
> > later
> >
> > No cookies for the reviewers on this one.
> >
> > This patch makes the behaviour of the fair zone allocator policy
> > configurable. By default it will only distribute pages that are going
> > to exist on the LRU between zones local to the allocating process. This
> > preserves the historical semantics of MPOL_LOCAL.
> >
> > By default, slab pages are not distributed between zones after this patch is
> > applied. It can be argued that they should get similar treatment but they
> > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > and the interaction between the page allocator and kswapd is different
> > for slabs. If it turns out to be an almost universal win, we can change
> > the default.
> >
> > Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
> > ---
> > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > include/linux/mmzone.h | 2 +
> > include/linux/swap.h | 2 +
> > kernel/sysctl.c | 8 ++++
> > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > 5 files changed, 134 insertions(+), 12 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb..8eaa562 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > - swappiness
> > - user_reserve_kbytes
> > - vfs_cache_pressure
> > +- zone_distribute_mode
> > - zone_reclaim_mode
> >
> > ==============================================================
> > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> >
> > ==============================================================
> >
> > +zone_distribute_mode
> > +
> > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > +system needs to reclaim memory, candidate pages are selected from these
> > +per-zone lists. Historically, a potential consequence was that recently
> > +allocated pages were considered reclaim candidates. From a zone-local
> > +perspective, page aging was preserved but from a system-wide perspective
> > +there was an age inversion problem.
> > +
> > +A similar problem occurs on a node level where young pages may be reclaimed
> > +from the local node instead of allocating remote memory. Unforuntately, the
> > +cost of accessing remote nodes is higher so the system must choose by default
> > +between favouring page aging or node locality. zone_distribute_mode controls
> > +how the system will distribute page ages between zones.
> > +
> > +0 = Never round-robin based on age
>
> I think we should be very conservative with the userspace interface we
> export on a mechanism we are obviously just figuring out.
>

And we have a proposal on how to limit this. I'll be layering another
patch on top and removes this interface again. That will allows us to
rollback one patch and still have a usable interface if necessary.

> > +Otherwise the values are ORed together
> > +
> > +1 = Distribute anon pages between zones local to the allocating node
> > +2 = Distribute file pages between zones local to the allocating node
> > +4 = Distribute slab pages between zones local to the allocating node
>
> Zone fairness within a node does not affect mempolicy or remote
> reference costs. Is there a reason to have this configurable?
>

Symmetry

> > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > +
> > +8 = Distribute anon pages between zones remote to the allocating node
> > +16 = Distribute file pages between zones remote to the allocating node
> > +32 = Distribute slab pages between zones remote to the allocating node
>
> Yes, it's conceivable that somebody might want to disable remote
> distribution because of the extra references.
>
> But at this point, I'd much rather back out anon and slab distribution
> entirely, it was a mistake to include them.
>
> That would leave us with a single knob to disable remote page cache
> placement.
>

When looking at this closer I found that sysv is a weird exception. It's
file-backed as far as most of the VM is concerned but looks anonymous to
most applications that care. That and MAP_SHARED anonymous pages should
not be treated like files but we still want tmpfs to be treated as
files. Details will be in the changelog of the next series.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/