Re: [PATCH] VM: add vm.free_node_memory sysctl

From: Martin Hicks
Date: Mon Aug 15 2005 - 11:05:53 EST



On Fri, Aug 05, 2005 at 11:48:58PM +0200, Andi Kleen wrote:
> On Fri, Aug 05, 2005 at 12:45:58PM -0500, Ray Bryant wrote:
>
> > discarded. I think what Martin and I would prefer to see is an interface
> > that allows one to just get rid of the clean page cache (or at least enough
> > of it) so that additional mapped page allocations will occur locally to the
> > node without causing swapping.
>
> That seems like a very special narrow case. But have you tried if memhog
> really doesn't work this way?

Yes. This *is* a very special narrow case. This doesn't really apply
to a desktop machine, nor to a normal unix server load.

It *does* apply to cleaning up a node (or set of nodes) before running HPC
apps which really need to get local memory to perform correctly.

I'm not suggesting that this is something useful for your average
computer, but it is a feature that would make life a lot easier on big
machines running HPC apps.

The memhog approach works, but it may negatively impact the performance
of other jobs on the same node (by swapping or needlessly forcing the
node into memory reclaim and dirty page writeback).

> >
> > AFAIK, the number of mapped pages on the node is not exported to user space
> > (by, for example, /sys). So there is no good way to size the "slop" to
> > allow for an existing allocation. If there was, then using a bound memory
> > hog would likely be a reasonable replacement for Martin's syscall to release
> > all free page cache, at least for small to medium sized sized systems.
>
> I guess it could be exported without too much trouble.

I did this to test out the memhog method.

> > The reason we ended up with a sysctl/syscall (to control the aggressiveness
> > with which __alloc_pages will try to free page cache before spilling) is that
> > deciding whether or not to spend the effort to free up page cache pages on
> > the local node before spilling is a workload dependent optimization. For
> > an HPC application it is typically worth the effort to try to free local
> > node page cache before spilling off node because the program will run
> > sufficiently long to make the improvement due to getting local storage
> > dominates the extra cost of doing the page allocation. For file server
> > workloads, for example, it is typically important to minimize the time to do
> > the page allocation; if it turns out to be on a remote node it really doesn't
> > matter that much. So it seems to me that we need some way for the
> > application to tell the system which approach it prefers based on the type of
> > workload it is -- hence the sysctl or syscall approach.
>
> Ideally it should just work transparently. Maybe NUMA allocation
> should be a bit more aggressive at cleaning local pages before fallback.
> Problem is that it potentially makes the fast path slow.

This is what we need: a better level of control over how NUMA
allocations work. In some cases we *really* would prefer local pages,
even at the cost of page cache.

It does have the potential to make the fast path slower. Some workloads
are willing to make this sacrifice.

mh

--
Martin Hicks || Silicon Graphics Inc. || mort@xxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/