Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

From: Oren Laadan
Date: Sat Nov 20 2010 - 13:05:29 EST


Hi,

Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...

part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches

[now relax, grab (another) cup of coffee and read on...]

PART I: ==PERSPECTIVE==

A rough classification of c/r categories:

* container-c/r: important use-case, e.g. c/r and migration of an
application containers like VPS (virtual private server), VDI
(desktop) or other self-contained application (e.g. Oracle server).
Here _all_ the relevant processes are included in the checkpoint.

* standalone-c/r: another use-case is standalone-c/r where a set of
processes is checkpointed, but not the entire environment, and then
those processes are restarted in a different "eco-system".

* distributed-c/r: meaning several sets of processes, each running
on a different host. (Each set may be a separate container there).

In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.

In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.

There are generally three types of "glue":

(1) Modify the application or selected libraries to be c/r-aware, and
notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
that will glue in what's missing (e.g. dbus or nscd calls to
reconnect an application to those services).

IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done !
We are strictly discussion the core c/r functionality.

(next part: linux-cr philosophy...)

Thanks,

Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/