Re: Some thoughts on memory policies

From: Lee Schermerhorn
Date: Tue Jun 19 2007 - 16:25:15 EST


On Mon, 2007-06-18 at 13:22 -0700, Christoph Lameter wrote:
> I think we are getting into more and more of a mess with the existing
> memory policies. The refcount issue with shmem is just one bad symptom of
> it. Memory policies were intended to be process based and
> taking them out of that context causes various issues.

I don't think memory policies are in as much of a mess as Christoph
seems to. Perhaps this is my ignorance showing. Certainly, there are
issues to be addressed--especially in the interaction of memory policies
with containers, such as cpusets. The shmem refcount issue may be one
of these issues--not sure how "bad" it is.

I agree that the "process memory policy"--i.e., the one set by
set_mempolicy()--is "process based", but I don't see the system default
policy as process based. The system default policy is, currently, the
policy of last resort for any allocation. And, [as I've discussed with
Christoph], I view policies applied via mbind() as applying to [some
range of] the "memory object" mapped at a specific address range. I
admit that this view is somewhat muddied by the fact that [private]
anonymous segments don't actually have any actual kernel structure to
represent them outside of a process's various anonymous VMAs and page
table [and sometimes the swap cache]; and by the fact that the kernel
currently ignores policy that one attempts to place on shared regular
file mappings. However, I think "object-based policy" is a natural
extension of the current API and easily implemented with the current
infrastructure.

>
> I have thought for a long time that we need something to replace memory
> policies especially since the requirements on memory policies go far
> beyond just being process based. So some requirements and ideas about
> memory policies.

Listing the requirements is a great idea. But I won't go so far as to
agree that we need to "replace memory policies" so much as rationalize
them for all the desired uses/contexts...

>
> 1. Memory policies must be attachable to a variety of objects
>
> - Device drivers may need to control memory allocations in
> devices either because their DMA engines can only reach a
> subsection of the system or because memory transfer
> performance is superior in certain nodes of the system.
>
> - Process. This is the classic usage scenario
>
> - File / Socket. One may have particular reasons to place
> objects on a set of nodes because of how the threads of
> an application are spread out in the system.

...how the tasks/threads are spread out and how the application accesses
the pages of the objects. Some accesses--e.g., unmapped pages of
files--are implicit or transparent to task. I guess any pages
associated with a socket would also be transparent to the application as
well?

>
> - Cpuset / Container. Some simple support is there with
> memory spreading today. That could be made more universal.

I've said before that I viewed cpusets as administrative contraints on
applications, where as policies are something that can be controlled by
the application or a non-privileged user. As cpusets evolve into more
general "containers", I think they'll become less visible to the
applications running within them. The application will see the
container as "the system"--at least, the set of system resources to
which the application has access.

The current memory policy APIs can work in such a "containerized"
environment if we can reconcile the policy APIs' notion of nodes with
the set of nodes that container allows. Perhaps we need to revisit the
"cpumemset" proposal that provides a separate node id namespace in each
container/cpuset. As a minimum, I think a task should be able to query
the set of nodes that it can use and/or have the system "do the right
thing" if the application specifies "all possible nodes" for, say, and
interleave policy.

>
> - System policies. The system policy is currently not
> modifiable. It may be useful to be able to set this.
> Small NUMA systems may want to run with interleave by default

Agreed. And, on our platforms, it would be useful to have a separately
specifiable system-wide [or container-wide] default page cache policy.

>
> - Address range. For the virtual memory address range
> this is included in todays functionality but one may also
> want to control the physical address range to make sure
> f.e. that memory is allocated in an area where a device
> can reach it.

For application usage? Does this mean something like an MPOL_MF_DMA
flag?

One way to handle this w/o an explicit 'DMA flag for use space APIs is
to mmap() the device that would use the memory and allow the device
driver to allocate the memory internally with the appropriate DMA/32
flags and map that memory into the task's address space. I think that
works today.

What other usage scenarios are you thinking of?

>
> - Memory policies need to be attachable to types of pages.
> F.e. executable pages of a threaded application are best
> spread (or replicated) whereas the stack and the data may
> best be allocated in a node local way.
> Useful categories that I can think of
> Stack, Data, Filebacked pages, Anonymous Memory,
> Shared memory, Page tables, Slabs, Mlocked pages and
> huge pages.

Rather, I would say, to "types of objects". I think all of the "types
of pages" you mention [except, maybe, mlocked?] can be correlated to
some structure/object to which policy can be attached. Regarding
"Mlocked pages"--are you suggesting that you might want to specify that
mlocked pages have a different policy/locality than other pages in the
same object?

Stack and data/heap can easily be handled by always defaulting the
process policy to node local [or perhaps interleaved across the nodes in
the container, if node local results in hot spots or other problems],
and explicitly binding other objects of interest, if performance
considerations warrant, using the mbind() API or by using fixed or
heuristic defaults.

>
> Maybe a set of global policies would be useful for these
> categories. Andy hacked subsystem memory policies into
> shmem and it seems that we are now trying to do the same
> for hugepages. Maybe we could get to a consistent scheme
> here?

Christoph, I wish you wouldn't characterize Andi's shared policy
infrastructure as a hack. I think it provides an excellent base
implementation for [shared] object-based policies. It extends easily to
any object that can be addressed by offset [page offset, hugepage
offset, ...]. The main issue is the generic one of memory policy on
object that can be shared by processes running in separate cpusets,
whether the sharing is intentional or not.

>
> 2. Memory policies need to support additional constraints
>
> - Restriction to a set of nodes. That is what we have today.

See "locality placement" below.

>
> - Restriction to a container or cpuset. Maybe restriction
> to a set of containers?

I don't know about a "set of containers", but perhaps you are referring
to sharing of objects between applications running in different
containers with potentially disjoint memory resources? That is
problematic. We need to enumerate the use cases for this and what the
desired behavior should be.

Christoph and I discussed one scenario: backup running in a separate
cpuset, disjoint from an application that mmap()s a file shared and
installs a shared policy on it [my "mapped file policy" patches would
enable this]. If the application's cpuset contains sufficient memory
for the application's working set, but NOT enough to hold the entire
file, the backup running in another cpuset reading the entire file may
push out pages of the application from it's cpuset because the object
policy constrains the pages to be located in the application's cpuset.

>
> - Strict vs no strict allocations. A strict allocation needs
> to fail if the constraints cannot be met. A non strict
> allocation can fall back.

Agreed. And I think this needs to be explicit in the allocation
request. Callers requesting strict allocation [including "no wait"]
should be prepared to handle failure of the allocation.

>
> - Order of allocation. Higher order pages may require
> different allocation constraints? This is like a
> generalization of huge page policies.

Agreed. On our platform, I'd like to keep default huge page allocations
and interleave requests off the "hardware interleaved pseudo-node" as
that is "special" memory. I'd like to reserve it for access only by
explicit request. The current model doesn't support this, but I think
it could, with a few "small" enhancements. [TODO]

>
> - Locality placement. These are node local, interleave etc.

How is this different from "restriction to a set of nodes" in the
context of memory policies [1st bullet in section 2]? I tend to think
of memory policies--whether default or explicit--as "locality placement"
and cpusets as "constraints" or restrictions on what policies can do.

>
> 3. Additional flags
>
> - Automigrate flag so that memory touched by a process
> is moved to a memory location that has best performance.

Automigration can be turned on/off in the environment--e.g., per
container/cpuset, but perhaps there is a use case for more explicit
control over automigration of specific pages of an object?

"Lazy migration" or "migrate on fault" is fairly easy to achieve atop
the existing migration infrastructure. However, it requires a fault to
trigger the migration. One can arrange for these faults to occur
explicitly--e.g., via a straightforward extension to mbind() with
MPOL_MF_MOVE and a new MPOL_MF_LAZY flag to remove the page translations
from all page tables resulting in a fault, and possible migration, on
next touch. Or, one can arrange to automatically "unmap" [remove ptes
referencing] selected types of pages when the load balancer moves a task
to a new node.

I've seen fairly dramatic reductions in real, user and system time in,
e.g., kernel builds on a heavily loaded [STREAMS benchmark running] NUMA
platform with automatic/lazy migration patches: ~14% real, ~4.7% user
and ~22% system time reductions.

>
> - Page order flag that determines the preferred allocation
> order. Maybe useful in connection with the large blocksize
> patch to control anonymous memory orders.

Agreed. "requested page order" could be a component of policy, along
with locality.

>
> - Replicate flags so that memory is replicated.

This could be a different policy mode, MPOL_REPLICATE. Or, as with
Nick's prototype, it could be the default behavior for read-only access
to page cache pages when no explicit policy exists on the object [file].

For "automatic, lazy replication, one would also need a fault to trigger
the replication. This could be achieved by removing the pte from only
the calling task's page table via mbind(MOVE+LAZY) or automatically on
inter-node task migration. The resulting fault, when that corresponding
virtual address is touched, would cause Nick's page cache replication
infrastructure to create/use a local copy of the page. It's "on my
list" ...
>
> 4. Policy combinations
>
> We need some way to combine policies in a systematic way. The current
> hieracy from System->cpuset->proces->memory range does not longer
> work if a process can use policies set up in shmem or huge pages.
> Some consistent scheme to combine memory policies would also need
> to be able to synthesize different policies. I.e. automigrate
> can be combined with node local or interleave and a cpuset constraint.

The big issue, here, for me, is the interaction of policy on shared
objects [shmem and shared regular file mappings] referenced from
different containers/cpusets. Given that we want to allow this--almost
can't prevent it in the case of regular file access--we need to specify
the use cases, what the desired behavior is for each such case, and
which scenarios to optimize for.

>
> 5. Management tools
>
> If we make the policies more versatile then we need the proper
> management tools in user space to set and display these policies
> in such a way that they can be managed by the end user. The esoteric
> nature of memory policy semantics makes them difficult to comprehend.

/proc/<pid>/numa_maps works well [with my patches] for object mapped
into a task's address space. What it doesn't work so well for are:
1) shared policy on currently unattached shmem segments and 2) shared
policy on unmapped regular files, should my patches be accepted. [Note,
however, we need not retain shared policy on regular files after the
last shared mapping is removed--my recommended persistence model.]


> 6. GFP_xx flags may actually be considered as a form of policy

Agreed. For kernel internal allocation requests...

>
> i.e. GFP_THISNODE is essentially a one node cpuset.

sort of behaves like one, I agree. Or like an explicit MPOL_BIND with a
single node.

>
> GFP_DMA and GFP_DMA32 are physical address range constraints.

with platform specific locality implications...

>
> GFP_HARDWALL is a strict vs. nonstrict distinction.
>
>
> 7. Allocators must change
>
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory
> requested.

Agreed. In my shared/mapped file policy patches, I have factored an
"allocate_page_pol() function out of alloc_page_vma(). The modified
alloc_page_vma() calls get_vma_policy() [as does the current version] to
obtain the policy at the specified address in the calling task's virtual
address space or some default policy, and then calls alloc_page_pol() to
allocate a page based on that policy. I can then use the same
alloc_page_pol() function to allocate page cache pages after looking up
a shared policy on a mapped file or using the default policy for page
cache allocations [currently process->system default]. Perhaps other of
the page allocators could use alloc_page_pol() as well?

>
>
> I wish we could come up with some universal scheme that encompasses all
> of the functionality we want and that makes memory more manageable....

I think it's possible and that the current mempolicy support can be
evolved with not too much effort. Again, the biggest issue for me is
the reconciliation of the policies with the administrative constraints
imposed by subsetting the system via containers/cpusets--especially for
objects that can be referenced from more than one container. I think
that any reasonable, let alone "correct", solution would be
workload/application dependent.


Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/