Re: [RFC v3][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart
From: Oren Laadan
Date: Thu Sep 04 2008 - 13:33:35 EST
Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx):
>> Create trivial sys_checkpoint and sys_restore system calls. They will
>> enable to checkpoint and restart an entire container, to and from a
>> checkpoint image file descriptor.
>>
>> The syscalls take a file descriptor (for the image file) and flags as
>> arguments. For sys_checkpoint the first argument identifies the target
>> container; for sys_restart it will identify the checkpoint image.
>>
>> Signed-off-by: Oren Laadan <orenl@xxxxxxxxxxxxxxx>
>> ---
[...]
>> +/**
>> + * sys_checkpoint - checkpoint a container
>> + * @pid: pid of the container init(1) process
>> + * @fd: file to which dump the checkpoint image
>> + * @flags: checkpoint operation flags
>> + */
>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>> +{
>> + pr_debug("sys_checkpoint not implemented yet\n");
>> + return -ENOSYS;
>> +}
>> +/**
>> + * sys_restart - restart a container
>> + * @crid: checkpoint image identifier
>
> So can we compare your api to Andrey's?
>
> You've explained before that crid is used to tie together multiple
> calls to checkpoint, but why do you have to specify it for restart?
> Can't it just come from the fd? Or, the fd will be passed in
> seek()d to the right position for the data for this task, so the crid
> won't be available there?
I added the 'crid' inside to support a mode of operation in which we
would like the checkpoint data to remain in memory across multiple
system calls. Here are example scenarios:
1) We will want to reduce down time by first buffering the checkpoint
image in memory, then resuming the container, and only then writing
the data back to a (the) file descriptor.
So instead of:
freeze -> checkpoint and write back -> unfreeze
We want:
freeze -> checkpoint to buffer -> unfreeze -> write back
I envision each of these steps to be a separate invocation of a syscall.
to the 'crid' returned by the sys_checkpoint() at the 2nd step, will be
used to identify that data in the 4th step. (Note, that between the
unfreeze and the write-back, another checkpoint may be already taken).
2) A task may want to take a checkpoint (e.g. of itself, or a whole
container) and keep that checkpoint in memory; at a later time it may
want to revert to that checkpoint. Moreover, it may keep multiple such
checkpoints (to where it may want to return). 'crid' tells sys_restart
which one to use.
Note that this 'crid' will in fact be tied to resources that are kept
by the kernel - e.g. references to COW pages (when we add that).
Louis suggested to use a specialized FD instead of a numeric 'crid'
(that is: create a anonymous inode and a struct file that represent
that checkpoint in the kernel, and return an FD to it). This approach
has pros and cons of 'crid' (see the archives of the containers
mailing list). For now I kept 'crid', but I'm definitely open to change
it to a FD.
Oren.
>
> Andrey, how will the 'ctid' in your patchset be used? It sounds
> like it's actually going to set some integer id on the created
> container? We actually don't have container ids (or even
> containers) right now, so we probably don't want that in our api,
> right?
>
>> + * @fd: file from which read the checkpoint image
>> + * @flags: restart operation flags
>> + */
>> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
>> +{
>> + pr_debug("sys_restart not implemented yet\n");
>> + return -ENOSYS;
>> +}
>> diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
>> index d739467..88bdec4 100644
>> --- a/include/asm-x86/unistd_32.h
>> +++ b/include/asm-x86/unistd_32.h
>> @@ -338,6 +338,8 @@
>> #define __NR_dup3 330
>> #define __NR_pipe2 331
>> #define __NR_inotify_init1 332
>> +#define __NR_checkpoint 333
>> +#define __NR_restart 334
>>
>> #ifdef __KERNEL__
>>
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index d6ff145..edc218b 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>> asmlinkage long sys_eventfd(unsigned int count);
>> asmlinkage long sys_eventfd2(unsigned int count, int flags);
>> asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
>> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
>>
>> int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
>>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index c11da38..fd5f7bf 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -779,6 +779,8 @@ config MARKERS
>>
>> source "arch/Kconfig"
>>
>> +source "checkpoint/Kconfig"
>> +
>> config PROC_PAGE_MONITOR
>> default y
>> depends on PROC_FS && MMU
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 08d6e1b..ca95c25 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
>> cond_syscall(compat_sys_timerfd_gettime);
>> cond_syscall(sys_eventfd);
>> cond_syscall(sys_eventfd2);
>> +
>> +/* checkpoint/restart */
>> +cond_syscall(sys_checkpoint);
>> +cond_syscall(sys_restart);
>> --
>> 1.5.4.3
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
>> https://lists.linux-foundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/