Re: [PATCH] VM: add vm.free_node_memory sysctl

From: Ray Bryant
Date: Fri Aug 05 2005 - 12:49:29 EST


On Wednesday 03 August 2005 15:08, Andi Kleen wrote:

> >
> > Hmmm.... What happens if there are already mapped pages (e. g. mapped in
> > the sense that pages are mapped into an address space) on the node and
> > you want to allocate some more, but can't because the node is full of
> > clean page cache pages? Then one would have to set the memhog argument
> > to the right thing to
>

> If you have a bind policy in the memory grabbing program then the standard
> try_to_free_pages should DTRT. That is because we generated a custom zone
> list only containing nodes in that zone and the zone reclaim only looks
> into those.
>

It may depend on what your definition of DTRT is here. :-)

As I understand things, if we have a node that has some mapped memory
allocated, and if one starts up a numactl -bind node memhog nodesize-slop so
as to clear some clean page cache pages from that node, then unless the
"slop" is sized in proportion to the amount of mapped memory used on the
node, then the existing mapped memory will get swapped out in order to
satisfy the new request. In addition, clean page-cache pages will get
discarded. I think what Martin and I would prefer to see is an interface
that allows one to just get rid of the clean page cache (or at least enough
of it) so that additional mapped page allocations will occur locally to the
node without causing swapping.

AFAIK, the number of mapped pages on the node is not exported to user space
(by, for example, /sys). So there is no good way to size the "slop" to
allow for an existing allocation. If there was, then using a bound memory
hog would likely be a reasonable replacement for Martin's syscall to release
all free page cache, at least for small to medium sized sized systems.

> With prefered or other policies it's different though, in that cases
> t_t_f_p will also look into other nodes because the policy is not binding.
>
> That said it might be probably possible to even make non bind policies more
> aggressive at freeing in the current node before looking into other nodes.
> I think the zone balancing has been mostly tuned on non NUMA systems, so
> some improvements might be possible here.
>
> Most people don't use BIND and changing the default policies like this
> might give NUMA systems a better "out of the box" experience. However this
> memory balance is very subtle code and easy to break, so this would need
> some care.
>

Of course!

> I don't think sysctls or new syscalls are the way to go here though.
>

The reason we ended up with a sysctl/syscall (to control the aggressiveness
with which __alloc_pages will try to free page cache before spilling) is that
deciding whether or not to spend the effort to free up page cache pages on
the local node before spilling is a workload dependent optimization. For
an HPC application it is typically worth the effort to try to free local
node page cache before spilling off node because the program will run
sufficiently long to make the improvement due to getting local storage
dominates the extra cost of doing the page allocation. For file server
workloads, for example, it is typically important to minimize the time to do
the page allocation; if it turns out to be on a remote node it really doesn't
matter that much. So it seems to me that we need some way for the
application to tell the system which approach it prefers based on the type of
workload it is -- hence the sysctl or syscall approach.

> -Andi

--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/