Can cpusets help me/us/Linux to get closer to these requirements?
A clear yes. Regard cpusets as a new kind of composite resource built
from memory and CPUs. They can play the role of the resource groups we
need. Disjunct cpusets can run jobs which will almost never interfere
cpu-cycle or memory-wise. This can be easilly integrated into PBS/LSF
or whatever batch resource manager comes to your mind. Cpusets
selected with some knowledge of the NUMA characteristics of a machine
guarantee always reproducible and best compute performance. If a job
runs alone in a cpuset it will run as if the machine has been reduced
to that piece and is owned exclusively by the job. Also if the set
contains as many CPUs as MPI processes, the cpuset helps getting some
sort of gang scheduling (i.e. all members of a parallel process get
cycles at the same time, this reduces barrier synchronisation times,
improves performance and makes it more predictible). This is something
one absolutely needs on big machines when dealing with time critical
highest performance applications. Permanently losing 10% because the
CPU placement is poor or because one has to get some other process out
of the way is just inacceptable. When you sell machines for several
millions 10% performance loss translates to quite some amount of
money.
Can CKRM (as it is now) fulfil the requirements?
I don't think so. CKRM gives me to some extent the confidence that I
will really use the part of the machine for which I paid, say 50%. But
it doesn't care about the structure of the machine. CKRM tries giving
a user as much of the machine as possible, at least the amount he paid
for. For example: When I come in with my job the machine might be
already running another job who's user also paid for 50% but was the
only user and got 100% of the machine (say some Java application with
enough threads...). This job maybe has filled up most of the memory
and uses all CPUs. CKRM will take care of getting me cycles (maybe
exclusively on 50% of the CPUs and will treat my job preferrentially
when allocating memory, but will not care about the placement of the
CPUs and the memory. Neither will it care whether the previously
running job is still using my memory blocks and reducing my bandwith
to them. So I get 50% of the cycles and the memory but these will be
BAD CYCLES and BAD MEMORY. My job will run slower than possible and a
second run will be again different. Don't misunderstand me: CKRM in
its current state is great for different things and running it inside
a cpuset sounds like a good thing to do.
What about integration with PBS/LSF and alike?
It makes sense to let an external resource manager (batch or
non-batch) keep track of and manage cpusets resources. It can allocate
them and give them to jobs (exclusively) and delete them. That's
perfect and exactly what we want. CKRM is a resource manager itself
and has an own idea about resources. Certainly PBS/LSF/etc. could
create a CKRM class for each job and run it in this class. The
difficulty is to avoid the resource managers to interfere and work
against each other. In such a setup I'd rather expect a batch manager
to be started inside one CKRM class and let it ensure that e.g. the
interactive class isn't starved by the batch class.
Can CKRM be extended to do what cpusets do?
Certainly. Probably easilly. But cpusets will have to be reinvented, I
guess. Same hooks, same checks, different user interface...