Re: checkpoint/restart ABI

From: Oren Laadan
Date: Wed Aug 20 2008 - 17:59:39 EST




Dave Hansen wrote:
On Tue, 2008-08-12 at 09:32 -0700, Jeremy Fitzhardinge wrote:
Inter-machine networking stuff is hard because its outside the checkpointed set, so the checkpoint is observable. Migration is easier, in principle, because you might be able to shift the connection endpoint without bringing it down. Dealing with networking within your checkpointed set is just fiddly, particularly remembering and restoring all the details of things like urgent messages, on-the-fly file descriptors, packet boundaries, etc.

All true. Hard stuff.

The IBM product works partly by limiting migrations to occurring on a
single physical ethernet network. Each container gets its own IP and
MAC address. The socket state is checkpointed quite fully and moved
along with the IP.

Unlinked files, for instance, are actually available in /proc. You can
freeze the app, write a helper that opens /proc/1234/fd, then copies its
contents to a linked file (ooooh, with splice!) Anyway, if we can do it
in userspace, we can surely do it in the kernel.
Sure, there's no inherent problem. But do you imagine including the file contents within your checkpoint image, or would they be saved separately?

Me, personally, I think I'd probably "re-link" the thing, mark it as
such, ship it across like a normal file, then unlink it after the
restore. I don't know what we'd choose when actually implementing it.

Re-linking works well when the file system supports that - some do not
allow this, in which case you need to silently rename instead of really
un-linking (even with NFS), or copy the entire contents.

Of course, you also need a snapshot of the file system in case it changes
after the checkpoint is taken, or take other measures. We can safely
defer addressing this for later.


I'm not sure what you mean by "closed files". Either the app has a fd,
it doesn't, or it is in sys_open() somewhere. We have to get the app
into a quiescent state before we can checkpoint, so we basically just
say that we won't checkpoint things that are *in* the kernel.
It's common for an app to write a tmp file, close it, and then open it a bit later expecting to find the content it just wrote. If you checkpoint-kill it in the interim, reboot (clearing out /tmp) and then resume, then it will lose its tmp file. There's no explicit connection between the process and its potential working set of files.

I respectfully disagree. The number one prerequisite for
checkpoint/restart is isolation. Xen just happens to get this for free.
So, instead of saying that there's no explicit connection between the
process and its working set, ask yourself how we make a connection.

In this case, we can do it with a filesystem (mount) namespace. Each
container that we might want to checkpoint must have its writable
filesystems contained to a private set that are not shared with other
containers. Things like union mounts would help here, but aren't
necessarily required. They just make it more efficient.

We had to deal with it by setting a bunch of policy files to tell the checkpoint/restart system what filename patterns it had to look out for. But if you just checkpoint the whole filesystem state along with the process(es), then perhaps it isn't an issue.

Right. We just start with "everybody has their own disk" which is slow
and crappy and optimize it from there.

Yep.

[SNIP]

Oren.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/