Re: [RFC] [PATCH 00/13] Introduce task_pid api

From: Cedric Le Goater
Date: Tue Nov 15 2005 - 17:55:33 EST

Next message: Paul Jackson: "Re: [PATCH 1/5] Light Fragmentation Avoidance V20:001_antidefrag_flags"
Previous message: Paul Jackson: "Re: [PATCH 0/5] Light Fragmentation Avoidance V20"
In reply to: Paul Jackson: "Re: [RFC] [PATCH 00/13] Introduce task_pid api"
Next in thread: Paul Jackson: "Re: [RFC] [PATCH 00/13] Introduce task_pid api"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Ray Bryant wrote:

> Personally, I think that these assumptions are incorrect for a
> checkpoint/restart facility. I think that:
>
> (1) It is really only possible to checkpoint/restart a cooperative process.

What do you mean by cooperative ? that the code should be modified to
cooperate with a checkpoint/restart tool ? do you have something else in mind ?

> For this to work with uncooperative processes you have to figure out (for
> example) how to save and restore the file system state.

Files are definitely very difficult to handle efficiently but there are
ways to deal with them. One way is not to deal with them at all and let the
application organize its data in such a way that we don't have to
checkpoint the file system, share storage is one solution, copying files
from a checkpointed node to another node is an other. It can be very
inefficient but it works.

But, I agree with you, we don't want to checkpoint a filesystem.

> (e. g. how do you get the file position set correctly for an open file in
> the restored program instance?)

well, if the file is available, lseek() should do the job. Pipes are more
difficult to handle than regular files in fact.

> And this doesn't even consider what to do with open network connections.

network connections are indeed very tricky. The network code is very
complex, very large, plenty of protocols to handle but it can be done for
TCP/IP by blocking the traffic, checkpointing the data and checkpointing
the PCBs. But tell me more, what are the main issues for you ?

Private interconnect are a challenge.

> Similarly, what does one do about the content of System V shared memory
> regions or the contents of System V semaphores?

Well, they have to be constrained in a known set of processes, or a
container, to make sure we are not checkpointing a moving target.

> I'm sure there are many more such problems we can come up with a careful
> study of the Linux/Unix API.

Oh yes, the UNIX API is very large but in checkpoint/restart we care more
about implementation. This can be tricky.

> (Note that "cooperation" in this context can also mean "willing to run inside
> of a container of some kind that supports checkpoint/restart".)

Indeed !

We need an isolation mecanism to make sure we control the boundaries of an
application. We don't want any leaks when we initiate a checkpoint.

> So you can probably only checkpoint the process at certain points in its
> lifetime, points which the application should be willing to identify in some
> way.

We do need to reach a quiescience point. a SIGSTOP is enough or a container
wide schedule(), a la software suspend. But no more.

> And I would argue that at such points in time, you can require that
> the current register state doesn't include the results of a system call such
> as getpid(), couldn't you?

Well, what if that register holds a virtualized pid, this is no more an
issue, nop ?

> (2) Checkpoint/Restart really only makes sense for a long running, resource
> intensive job. (e. g. for a job that is doing a lot of work and hence, for
> which, recovery is hard -- perhaps as hard as re-running the entire job).

HPC industry is indeed an obvious target.

However, we have successfully checkpinted desktop applications like
openoffice, thunderbird, mozilla, emacs, etc. We are still working on pty
in order to migrate terminals. We think in can also be useful in that area
and others.

> By their very nature, there are probably only a few such jobs running on the
> system. If there are lots of such jobs on the system, then re-running each
> one can't be that hard, can it?

hmm, didn't get your point here ? can you elaborate ?

> So, I guess my question is wrt the task_pid API is the following: Given that
> there are a lot of other problems to solve before transparent checkpointing
> of uncooperative processes is possible, why should this partial solution be
> accepted into the main line kernel and "set in stone" so to speak?

Well, let's say that we want to present this one step after the other and
try to have each step brings some interesting value to the linux kernel.

Process aggregation is the first big step, other projects have shown
interest in this area, PAGG for instance. Isolation is another. The
virtualization step could be thought as dedicated to checkpoint/restart but
we're pretty sure it should help some projects like vserser that need to
virtualize some ancestor pid. On that subject, having a way to manage
cluster wide pids could be useful to HPC batch managers.

> Don't get me wrong, I would love for there to be a commonly accepted
> checkpoint/restart API. But I don't think that this can be done
> transparently at the kernel level and without some cooperation from the
> target task.

Well, we've already migrated some pretty ugly applications, database
engines, without modifying them :)

C.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul Jackson: "Re: [PATCH 1/5] Light Fragmentation Avoidance V20:001_antidefrag_flags"
Previous message: Paul Jackson: "Re: [PATCH 0/5] Light Fragmentation Avoidance V20"
In reply to: Paul Jackson: "Re: [RFC] [PATCH 00/13] Introduce task_pid api"
Next in thread: Paul Jackson: "Re: [RFC] [PATCH 00/13] Introduce task_pid api"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]