Re: [PATCH 1/5] cpuset memory spread basic implementation

From: Ingo Molnar
Date: Mon Feb 06 2006 - 01:02:12 EST



* Andrew Morton <akpm@xxxxxxxx> wrote:

> Paul Jackson <pj@xxxxxxx> wrote:
> >
> > This policy can provide substantial improvements for jobs that
> > need to place thread local data on the corresponding node, but
> > that need to access large file system data sets that need to
> > be spread across the several nodes in the jobs cpuset in order
> > to fit. Without this patch, especially for jobs that might
> > have one thread reading in the data set, the memory allocation
> > across the nodes in the jobs cpuset can become very uneven.
>
>
> It all seems rather ironic. We do vast amounts of development to make
> certain microbenchmarks look good, then run a real workload on the
> thing, find that all those microbenchmark-inspired tweaks actually
> deoptimised the real workload? So now we need to add per-task knobs
> to turn off the previously-added microbenchmark-tweaks.
>
> What happens if one process does lots of filesystem activity and
> another one (concurrent or subsequent) wants lots of thread-local
> storage? Won't the same thing happen?
>
> IOW: this patch seems to be a highly specific bandaid which is
> repairing an ill-advised problem of our own making, does it not?

i suspect it all depends whether the workload is 'global' or 'local'.
Lets consider the hypothetical case of a 16-node box with 64 CPUs and 1
TB of RAM, which could have two fundamental types of workloads:

- lots of per-CPU tasks which are highly independent and each does its
own stuff. For this case we really want to allocate everything per-CPU
and as close to the task as possible.

- 90% of the 1 TB of RAM is in a shared 'database' that is accessed by
all nodes in a nonpredictable pattern, from any CPU. For this case we
want to 'spread out' the pagecache as much as possible. If we dont
spread it out then one task - e.g. an initialization process - could
create a really bad distribution for the pagecache: big continuous
ranges allocated on the same node. If the workload has randomness but
also occasional "medium range" locality, an uneven portion of the
accesses could go to the same node, hosting some big continuous chunk
of the database. So we want to spread out in an as finegrained way as
possible. (perhaps not too finegrained though, to let block IO still
be reasonably batched.)

we normally optimize for the first case, and it works pretty well on
both SMP and NUMA. We do pretty well with the second workload on SMP,
but on NUMA, the non-spreadig can hurt. So it makes sense to
artificially 'interleave' all the RAM that goes into the pagecache, to
have a good statistical distribution of pages.

neither workload is broken, nor did we do any design mistake to optimize
the SMP case for the first one - it is really the common thing on most
boxes. But the second workload does happen too, and it conflicts with
the first workload's needs. The difference between the workloads cannot
be bridged by the kernel: it is two very different access patterns that
results from the problem the application is trying to solve - the kernel
cannot influence that.

I suspect there is no other way but to let the application tell the
kernel which strategy it wants to be utilized.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/