Re: [PATCH v13 00/14] KVM: Dirty ring interface

From: Paolo Bonzini
Date: Fri Nov 06 2020 - 06:28:29 EST


On 01/10/20 03:20, Peter Xu wrote:
KVM branch:
https://github.com/xzpeter/linux/tree/kvm-dirty-ring

QEMU branch for testing:
https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

v13:
- rebase to kvm/queue rather than 5.9-rc7. I think, kvm/queue is broken. I
can only test the dirty ring after I revert 3eb900173c71 ("KVM: x86: VMX:
Prevent MSR passthrough when MSR access is denied", 2020-09-28), otherwise
the guest will hang on vcpu0 looping forever during boot Linux.
- added another trivial patch "KVM: Documentation: Update entry for
KVM_X86_SET_MSR_FILTER". It should be squashed into 1a155254ff93 ("KVM: x86:
Introduce MSR filtering", 2020-09-28) directly.

v12:
- add r-bs for Sean
- rebase

v11:
- rebased to kvm/queue (seems the newest)
- removed kvm_dirty_ring_waitqueue() tracepoint since not used
- set memslot->as_id in kvm_delete_memslot() [Sean]
- let __copy_to_user() always return -EFAULT [Sean]
- rename 'r' in alloc_apic_access_page into 'hva' [Sean]

v10:
- remove unused identity_map_pfn in init_rmode_identity_map [syzbot]
- add "static" to kvm_dirty_ring_full [syzbot]
- kvm_page_in_dirty_ring() use "#if" macros for KVM_DIRTY_LOG_PAGE_OFFSET to
quiesce syzbot [syzbot]
- s/false/null/ in gfn_to_memslot_dirty_bitmap() [syzbot]

v9:
- patch 3: __x86_set_memory_region: squash another trivial change to return
(0xdeadull << 48) always for slot removal [Sean]
- pick r-bs for Drew

For previous versions, please refer to:

V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@xxxxxxxxxx
V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@xxxxxxxxxx
V3: https://lore.kernel.org/kvm/20200109145729.32898-1-peterx@xxxxxxxxxx
V4: https://lore.kernel.org/kvm/20200205025105.367213-1-peterx@xxxxxxxxxx
V5: https://lore.kernel.org/kvm/20200304174947.69595-1-peterx@xxxxxxxxxx
V6: https://lore.kernel.org/kvm/20200309214424.330363-1-peterx@xxxxxxxxxx
V7: https://lore.kernel.org/kvm/20200318163720.93929-1-peterx@xxxxxxxxxx
V8: https://lore.kernel.org/kvm/20200331190000.659614-1-peterx@xxxxxxxxxx
V9: https://lore.kernel.org/kvm/20200523225659.1027044-1-peterx@xxxxxxxxxx
V10: https://lore.kernel.org/kvm/20200601115957.1581250-1-peterx@xxxxxxxxxx/

Overview
============

This is a continued work from Lei Cao <lei.cao@xxxxxxxxxxx> and Paolo
Bonzini on the KVM dirty ring interface.

The new dirty ring interface is another way to collect dirty pages for
the virtual machines. It is different from the existing dirty logging
interface in a few ways, majorly:

- Data format: The dirty data was in a ring format rather than a
bitmap format, so dirty bits to sync for dirty logging does not
depend on the size of guest memory any more, but speed of
dirtying. Also, the dirty ring is per-vcpu, while the dirty
bitmap is per-vm.

- Data copy: The sync of dirty pages does not need data copy any more,
but instead the ring is shared between the userspace and kernel by
page sharings (mmap() on vcpu fd)

- Interface: Instead of using the old KVM_GET_DIRTY_LOG,
KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses the new
KVM_RESET_DIRTY_RINGS ioctl when we want to reset the collected
dirty pages to protected mode again (works like
KVM_CLEAR_DIRTY_LOG, but ring based). To collecting dirty bits,
we only need to read the ring data, no ioctl is needed.

Ring Layout
===========

KVM dirty ring is per-vcpu. Each ring is an array of kvm_dirty_gfn
defined as:

struct kvm_dirty_gfn {
__u32 flags;
__u32 slot; /* as_id | slot_id */
__u64 offset;
};

Each GFN is a state machine itself. The state is embeded in the flags
field, as defined in the uapi header:

/*
* KVM dirty GFN flags, defined as:
*
* |---------------+---------------+--------------|
* | bit 1 (reset) | bit 0 (dirty) | Status |
* |---------------+---------------+--------------|
* | 0 | 0 | Invalid GFN |
* | 0 | 1 | Dirty GFN |
* | 1 | X | GFN to reset |
* |---------------+---------------+--------------|
*
* Lifecycle of a dirty GFN goes like:
*
* dirtied collected reset
* 00 -----------> 01 -------------> 1X -------+
* ^ |
* | |
* +------------------------------------------+
*
* The userspace program is only responsible for the 01->1X state
* conversion (to collect dirty bits). Also, it must not skip any
* dirty bits so that dirty bits are always collected in sequence.
*/

Testing
=======

This series provided both the implementation of the KVM dirty ring and
the test case. Also I've implemented the QEMU counterpart that can
run with the new KVM, link can be found at the top of the cover
letter. However that's still a very initial version which is prone to
change and future optimizations.

I did some measurement with the new method with 24G guest running some
dirty workload, I don't see any speedup so far, even in some heavy
dirty load it'll be slower (e.g., when 800MB/s random dirty rate, kvm
dirty ring takes average of ~73s to complete migration while dirty
logging only needs average of ~55s). However that's understandable
because 24G guest means only 1M dirty bitmap, that's still a suitable
case for dirty logging. Meanwhile heavier workload means worst case
for dirty ring.

More tests are welcomed if there's bigger host/guest, especially on
COLO-like workload.

Please review, thanks.

Peter Xu (14):
KVM: Documentation: Update entry for KVM_X86_SET_MSR_FILTER
KVM: Cache as_id in kvm_memory_slot
KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
KVM: X86: Implement ring-based dirty memory tracking
KVM: Make dirty ring exclusive to dirty bitmap log
KVM: Don't allocate dirty bitmap if dirty ring is enabled
KVM: selftests: Always clear dirty bitmap after iteration
KVM: selftests: Sync uapi/linux/kvm.h to tools/
KVM: selftests: Use a single binary for dirty/clear log test
KVM: selftests: Introduce after_vcpu_run hook for dirty log test
KVM: selftests: Add dirty ring buffer test
KVM: selftests: Let dirty_log_test async for dirty ring test
KVM: selftests: Add "-c" parameter to dirty log test

Documentation/virt/kvm/api.rst | 126 ++++-
arch/x86/include/asm/kvm_host.h | 6 +-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu/mmu.c | 10 +-
arch/x86/kvm/svm/avic.c | 9 +-
arch/x86/kvm/vmx/vmx.c | 96 ++--
arch/x86/kvm/x86.c | 46 +-
include/linux/kvm_dirty_ring.h | 103 ++++
include/linux/kvm_host.h | 19 +
include/trace/events/kvm.h | 63 +++
include/uapi/linux/kvm.h | 53 ++
tools/include/uapi/linux/kvm.h | 77 ++-
tools/testing/selftests/kvm/Makefile | 2 -
.../selftests/kvm/clear_dirty_log_test.c | 6 -
tools/testing/selftests/kvm/dirty_log_test.c | 505 ++++++++++++++++--
.../testing/selftests/kvm/include/kvm_util.h | 4 +
tools/testing/selftests/kvm/lib/kvm_util.c | 72 ++-
.../selftests/kvm/lib/kvm_util_internal.h | 4 +
virt/kvm/dirty_ring.c | 197 +++++++
virt/kvm/kvm_main.c | 168 +++++-
21 files changed, 1432 insertions(+), 138 deletions(-)
create mode 100644 include/linux/kvm_dirty_ring.h
delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
create mode 100644 virt/kvm/dirty_ring.c


Queued, thanks! Now on to QEMU...

Paolo