Re: [PATCH 1/5] cpuset memory spread basic implementation

From: Ingo Molnar
Date: Mon Feb 06 2006 - 16:06:47 EST


* Christoph Lameter <clameter@xxxxxxxxxxxx> wrote:

> On Mon, 6 Feb 2006, Ingo Molnar wrote:
>
> > it's a feature, not a weird effect! Under the VFS-driven scheme, if two
> > projects (one 'local' and one 'global') can access the same (presumably
> > big) file, then the sysadmin has to make up his mind and determine which
> > policy to use for that file. The file will either be local, or global -
> > consistently.
>
> But that local or global allocation policy depends on what task is
> accessing the data at what time. A simple grep should not result in
> interleaving. A big application accessing the same data from multiple
> processes should have interleaving for shared data. Both may not be
> active at the same time.

the grep faults in the pagecache, and depending on which job is active
first, the placement of the pages will either be spread out or local,
depending on the timing of those jobs. How do you expect this to behave
deterministically?

> > with the per-cpuset policy approach on the other hand it would be
> > non-deterministic which policy the file gets allocated under: whichever
> > cpuset first manages to touch that file. That is what i'd call a weird
> > and undesirable effect. This weirdness comes from the conceptual hickup
> > of attaching the object-allocation policy to the workload, not to the
> > file objects of the workload - hence conflicts can arise if two
> > workloads share file objects.
>
> Well these weird effects are then at least expected since there was a
> cpuset set up for applications to activate this effect and the
> processes running in that cpuset will behave in the weird way we want.

nondeterministic placement of pagecache pages sure looks nasty. In most
cases i suspect what matters are project-specific data files - which
will be allocated deterministically because they are mostly private to
the cpuset. But e.g. /usr files want to be local in most cases, even for
a 'spread out' cpuset. Why would you want to allocate them globally?

> The mountpoint option means that reading the contents of a file in
> some filesystems is slower than in others because some files spread
> their pages all over the system while others are node local. Again if
> the process is single threaded the node local is always the right
> approach. These single threaded processes will no longer be able to
> run with full pagecache speed. Memory will be used in other nodes that
> may have been reserved for other purposes by the user.

but a single object cannot be allocated both locally and globally!
(well, it could be, for read-mostly workloads, but lets ignore that
possibility) So instead of letting chance determine it, it is the most
natural thing to let the object (or its container) determine which
strategy to use - not the workload. This avoids the ambiguity at its
core.

so if two projects want to use the same file in two different ways at
the same time then there is no solution either under the VFS-based or
under the cpuset-based approach - but at least the VFS-based method is
fully predictable, and wont depend on which science department starts
its simulation job first ...

if two projects want to use the same file in two different ways but not
at the same time, then again the VFS-based method is better: each
project, when it starts to run, could/should set the policy of that
shared data. (which setting of policy would also flush all pagecache
pages of the affected file[s], flushing any prior incorrect placement of
pages)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/