On Thursday 20 June 2002 12:23 pm, Jesse Pollard wrote:
<snip>
> You don't use compute servers much? The problems we are currently running
> require the cluster (IBM SP) to have 100% uptime for a single job. that
> job may run for several days. If a detected problem is reported (not yet
> catastrophic) it is desired/demanded to checkpoint the users process.
>
> Currently, we can't - but should be able to by this fall.
>
> Having the users job checkpoint midway in it's computations will allow us
> to remove a node from active service, substitute a different node, and
> resume the users process without losing many hours of computation (we have
> a maximum of 300 nodes for computation, another 30 for I/O and front end).
Have you tried Condor? Condor is a "high throughput computing" package,
specifically targetted at such applications, with the ability to checkpoint &
migrate jobs, etc. Condor is free as in beer, but currently not as in speech
(sorry), and is developed by the University of Wisconsin.
http://www.condorproject.org is the URL to learn more. Version 6.4.0 is in
the process of being released and should be available within the next couple
of days.
Condor runs on Linux (x86 & Alpha), Solaris, IRIX, HPUX, Digital Unix, and
NT, although the NT usually lags the Unix releases.
-Nick
Academic Staff at UW on the Condor Team
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sun Jun 23 2002 - 22:00:22 EST