Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

From: Oren Laadan
Date: Sat Nov 06 2010 - 18:55:50 EST




On 11/05/2010 08:36 PM, Kapil Arya wrote:
>> I'm probably missing something but can't you stop the application
>> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
>> about -EINTR failures (there are some exceptions but nothing really to
>> worry about). Also, unless the manager thread needs to be always
>> online, you can inject manager thread by manipulating the target
>> process states while taking a snapshot.
>
> In fact CryoPid uses exactly the same approach and has been around for around 5
> years. Not as much development effort has gone into CryoPid as DMTCP and so its
> application coverage is not as broad. But the larger issue for using PTRACE is
> that you can not have two superiors tracing the same inferior process. So if you
> want to checkpoint a gdb session or valgrind or tmux or strace, then you can not
> directly control and quiesce the inferior process being traced.
>
> Beyond that, we also have a vision (not yet implemented) of process
> virtualization by which one can change the behavior of a program. For example,
> if a distributed computation runs over infiniband, can we migrate to a TCP/IP
> cluster. For this, one needs the flexibility of wrappers around system calls.
> This vision of process virtualization also motivates why our own research
> project has steered away from in-kernel C/R.

This is a very useful vision. However, it is unrelated to how you
do c/r, but rather to what you do after you restart and before you
let the application resume execution.

For example, in your example, you'd need to wrap the library calls
(e.g. of MPI implementation) and replaced them to use TCP/IP or
infiniband. Wrapping on system calls won't help you.

Or you could just replace the resource - e.g., make the restarted
application use s socket for stdout instead of the tty, so you can
redirect the output to where-ever.

Both methods are orthogonal to the c/r itself: linux-cr will allow
you to replace/modify resources if you so wish, and I suspect that
MTCP also can/will.

Interposing on library calls is possible with MTCP methods, or
using binary instrumentation, or PIN, or DynInst, or LD_PRELOAD.

The only two reasons to interpose on systems calls, as I noted
in earlier message (http://lkml.org/lkml/2010/11/5/262 - see
points "2)" and "3)" about userland-workarounds):

One - to virtualize in userspace reosurces (e.g. pids) that the
kernel already knows how to virtualize.

Two - to track state of resources during execution and lie about
their state when needed, because userspace can't cleanly save
and restore their state.

Virtualization through interposition is extremely tricky in and
out of the kernel. The examples given throughout this thread (by
either side) expose the tip of the iceberg. Interposition as a
technique is full of security and other pitfalls, as discussed
by extensive literature in the area. (I cited in another email).

So I'll repeat the question I asked there: is re-reimplementing
chunks of kernel functionality and all namespaces in userspace
the way to go ?

>
>>> But since you ask :-), there is one thing on our wish list. We
>>> handle address space randomization, vdso, vsyscall, and so on quite
>>> well. We do not turn off address space randomization (although on
>>> restart, we map user segments back to their original addresses).
>>> Probably the randomized value of brk (end-of-data or end of heap) is
>>> the thing that gave us the most troubles and that's where the code
>>> is the most hairy.
>>

[snip]

> The design of DMTCP was decided upon roughly during the period from Linux 2.6.3
> through Linux 2.6.18. At that time, /proc/*/net did not exist. You are right
> that this can provide much better design for DMTCP and eliminate some of our
> wrappers. Thanks very much for pointing this out. We are now egar to implement a
> new design based on /proc/*/net in the near future.
>
> Since /proc/*/net provides a simpler design for sockets, we started wondering
> what other simplifications may be possible. Here is one possibility, in the case
> of shared file descriptors, DMTCP goes through two barriers in order to decide
> which process will be responsible for checkpointing which shared-file
> descriptor. It works and the overhead is reasonable, but if you have additional
> suggestion for this case, we would be very interested.

What is "reasonable" overhead ?
For which applications ?
What about a 'kernel make' ?
What about servers (db, web, etc) ?
What about VPSs/VDIs ?
Can we do better, including for HPC ?
...

>
>> I think this is why userland CR implementation makes much more sense.
>> Most of states visible to a userland process are rather rigidly
>> defined by standards and, ultimately, ABI and the kernel exports most
>> of those information to userland one way or the other. Given the
>> right set of needed features, most of which are probabaly already
>> implemented, a userland implementation should have access to most
>> information necessary to checkpoint without resorting to too messy
>> methods and then there inevitably needs to be some workarounds to make
>> CR'd processes behave properly w.r.t. other states on the system, so
>> userland workarounds are inevitable anyway unless it resorts to
>> preemtive separation using namespaces and containers, which I frankly
>> think isn't much of value already and more so going forward.
>
> Its a very good point and we agree completely. Here are some examples where we
> believe, a userland component is inevitable even if one begins with in-kernel
> C/R:

Exactly ! Wrapping around apps to isolate them from the environment
is desirable, regardless of how you technically c/r the apps, when
you want to be able to c/r apps outside their native environment.

Generally, you can either include the environment in the checkpoint,
or provide wrappers to virtualize it after restart, or modify the app
so that it knows how to adapt to new environments after restart.

Either way, you need to technically c/r the app, no matter how much
userspace trickery you may choose to apply afterwards if needed. And
doing so in-kernel is more transparent (yes, transparent means that
it does not require LD_PRELOAD or collaboration of the application!
nor does it require userspace virtualizations of so many things
already provided by the kernel today), more generic, more flexible,
provides more guarantees, cover more types or states of resources,
and can perform significantly better.

And then, if you want to work with dmtcp's type of scenarios, you
could use the generic c/r and apply their wrappers on top of it !

[snip]

Thanks,

Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/