Re: [PATCH 00/17] RFC: userfault v2

From: zhanghailiang
Date: Sat Nov 01 2014 - 04:57:09 EST


On 2014/11/1 3:39, Peter Feiner wrote:
On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page before it
is dirtied by writing action. This is the difference, compared to pre-copy migration.

Ah ha, I understand the difference now. I suppose that you have considered
doing a traditional pre-copy migration (that is, passes over memory saving
dirty pages, followed by a pause and a final dump of remaining dirty pages) to
a file. Your approach has the advantage of having the VM pause time bounded by
the time it takes to handle the userfault and do the write, as opposed to
pre-copy migration which has a pause time bounded by the time it takes to do
the final dump of dirty pages, which, in the worst case, is the time it takes
to dump all of the guest memory!


Right! Strictly speaking, Migrate VM's state into a file(fd) is not snapshot,
Because its time is not decided (depend on the time of finishing mingration).
A VM's snasphot should be decided, it should be the time when i fire snapshot
command.
Snapshot is very like taking a photo, getting a VM's state on the time;)

You could use the old fork & dump trick. Given that the guest's memory is
backed by private VMA (as of a year ago when I last looked, is always the case
for QEMU), you can have the kernel do the write protection for you.
Essentially, you fork Qemu and, in the child process, dump the guest memory
then exit. If the parent (including the guest) writes to guest memory, then it
will fault and the kernel will copy the page.


It is difficult to do fork in qemu process, which has multi-threads and holds
all kinds of locks. actually, this scheme has been discussed in community long time
ago. It is not accepted.

The fork & dump approach will give you the best performance w.r.t. guest pause
times (i.e., just pausing for the COW fault handler), but it does have the
distinct disadvantage of potentially using 2x the guest memory (i.e., if the

Agreed! This is the second reason why community does not accept it.

parent process races ahead and writes to all of the pages before you finish the
dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
memory as you copy it.


IMHO,The scheme i mentioned in the previous email, may be the simplest and the
most efficient way, if userfault could support only wrprotect fault.
We can also do some optimization to reduce influence for VM when do snapshot,
such as caching the request pages by using memory buffer, etc.

Great! Do you plan to issue your patches to community? I mean is your work based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.

I absolutely plan on releasing these patches :-) CRIU was the first open-source
userland I had planned on integrating with. At Google, I'm working with our
home-grown Qemu replacement. However, I'd be happy to help with an effort to
get softdirty integrated in Qemu in the future.


Great;)

Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To

I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.

make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.

How can i find the API? Is it been merged in kernel's master branch already?

Negative. I'll be sure to CC you when I start sending this stuff upstream.



OK, I look forward to it:)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/