Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu andmemory placement

From: Paul Jackson
Date: Sat Oct 02 2004 - 21:29:36 EST


Andrew writes:
>
> Despite what Paul says, his customers *do not* "require" physical isolation
> [*]. That's like an accountant requiring that his spreadsheet be written
> in Pascal. He needs slapping.

No - it's like an accountant saying the books for your two sole
proprietor Subchapter S corporations have to be kept separate.

Consider the following use case scenario, which emphasizes this
isolation aspect (and ignores other requirements, such as the need for
system admins to manage cpusets by name [some handle valid across
process contexts], with a system wide imposed permission model and
exclusive use guarantees, and with a well defined system supported
notion of which tasks are "in" which cpuset at any point in time).

===

You're running a 64-way, compute bound application on 64 CPUs of your
256 CPU system. The 64 threads are in lock step, tightly coupled, for
three days straight. You've sized the application and the computer you
bought to run that application to within the last few percent of what
CPU cycles are available on 64 CPUs and how many memory pages are
available on the nodes local to those CPUs. It's an MPT application in
Fortran, using most of the available bandwidth between those nodes for
synconization on each loop of the computation. If a single thread slows
down 10% for any reason, the entire application slows down that much
(sometimes worse), and you have big money on the table, ensuring that
doesn't happen. You absolutely positively have to complete that
application run on time, in three days (say it's a weather forecast for
four days out). You've varied the resolution to which you compute the
answer or the size of your input data set or whatever else you could, in
order to obtain the most accurate answer you could, in three days, not
an hour longer. If the runtimes jump around by more than 5% or 10%,
some Vice President starts losing sleep. If it's a 20% variation, that
sleep deprived Vice President works for the computer company that sold
you the system. The boss of the boss of my boss ;).

I now know that every one of these 64 threads is pinned for those three
days. It's just as pinned as the graphics application that has to be
near its hardware. Due to both the latency affects of the several
levels of hardware cache (on the CPU chip and off), and the additional
latency affects imposed by the software when it decides on which node to
place a page of memory off a page fault, nothing can move. Not in, not
out, not within. To within a fraction of a percent, nothing else may be
allowed onto those nodes, nothing of those 64 threads may be allowed off
those nodes, and none of the threads may be allowed to move within the
64 CPUs. And not just any random subset of 64 CPUs selected from the
256 available, but a subset that's "close" together, given the complex
geometries of these big systems (minimum number of router hops between
the furthest apart pair of CPUs in the set of 64 CPUs).

(*) Message Passing Interface (MPI) - http://www.mpi-forum.org

===

It's a requirement, I say. It's a requirement. Let the slapping begin ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.650.933.1373
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/