Sandy Harris <pashley@storm.ca> writes:
> [ I removed half a dozen cc's on this, and am just sending to the
> list. Do people actually want the cc's?]
>
> Larry McVoy wrote:
>
> > > Checkpointing buys three things. The ability to preempt jobs, the
> > > ability to migrate processes,
> For large multi-processor systems, it isn't clear that those matter
> much.
The systems that are built because there is no machine that can
run your compute intensive application fast enough they matter quite
a bit.
> What combination of resources and loads do you think preemption
> and migration are need for?
Good answers have already been given.
The problem domain I am looking at are compute clusters. The
solutions are useful elsewhere but in compute clusters they are
extremely valuable.
> > > and the ability to recover from failed nodes, (assuming the
> > > failed hardware didn't corrupt your jobs checkpoint).
>
> That matters, but it isn't entirely clear that it needs to be done
> in the kernel.
I agree, glibc would be fine, but it must be below the level of
the application. Generally it is a pretty onerous task to checkpoint
a random program. For a proof attempt to checkpoint your X desktop,
the infrastructure is there to do it.
Every application must be capable of checkpointing it for the cluster
batch scheduler to take advantage of it.
Example case.
[Preemption]
You start job 1, a compute intensive application that runs for 4 days,
on 100 cpus. Your job is low priority.
In comes job2, a high priority job that runs for 4 hours and needs 256
cpus.
job1 is preempted. With checkpoint support it can be saved and
restarted later. Without checkpointing support it is simply killed.
[Migration]
Migration is needed for failing hardware or to get low priority jobs
out of the way onto less capable nodes that are going unused.
Or to restart a job that failed on other hardware.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sun Jun 23 2002 - 22:00:24 EST