Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking

From: Alex Williamson
Date: Thu Jan 09 2020 - 11:56:43 EST


On Thu, 9 Jan 2020 11:29:28 -0500
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

> On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@xxxxxxxxxxx> and Paolo Bonzini <pbonzini@xxxxxxxxxx>. [1]
> >
> > KVM currently uses large bitmaps to track dirty memory. These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information. The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another. However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> >
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial. In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> >
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).
>
> No longer, this uses an array of structs.
>
> > This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> >
> > This patch enables dirty ring for X86 only. However it should be
> > easily extended to other archs as well.
> >
> > [1] https://patchwork.kernel.org/patch/10471409/
> >
> > Signed-off-by: Lei Cao <lei.cao@xxxxxxxxxxx>
> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > ---
> > Documentation/virt/kvm/api.txt | 89 ++++++++++++++++++
> > arch/x86/include/asm/kvm_host.h | 3 +
> > arch/x86/include/uapi/asm/kvm.h | 1 +
> > arch/x86/kvm/Makefile | 3 +-
> > arch/x86/kvm/mmu/mmu.c | 6 ++
> > arch/x86/kvm/vmx/vmx.c | 7 ++
> > arch/x86/kvm/x86.c | 9 ++
> > include/linux/kvm_dirty_ring.h | 55 +++++++++++
> > include/linux/kvm_host.h | 26 +++++
> > include/trace/events/kvm.h | 78 +++++++++++++++
> > include/uapi/linux/kvm.h | 33 +++++++
> > virt/kvm/dirty_ring.c | 162 ++++++++++++++++++++++++++++++++
> > virt/kvm/kvm_main.c | 137 ++++++++++++++++++++++++++-
> > 13 files changed, 606 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/kvm_dirty_ring.h
> > create mode 100644 virt/kvm/dirty_ring.c
> >
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index ebb37b34dcfc..708c3e0f7eae 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> > It is thus encouraged to use the vm ioctl to query for capabilities (available
> > with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >
> > +
> > 4.5 KVM_GET_VCPU_MMAP_SIZE
> >
> > Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> > memory region. This ioctl returns the size of that region. See the
> > KVM_RUN documentation for details.
> >
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > + KVM_CAP_COALESCED_MMIO is not documented yet.
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
> > + KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >
> > 4.6 KVM_SET_MEMORY_REGION
> >
> > @@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> > AArch64, this value will be reported in the ISS field of ESR_ELx.
> >
> > See KVM_CAP_VCPU_EVENTS for more details.
> > +
> > 8.20 KVM_CAP_HYPERV_SEND_IPI
> >
> > Architectures: x86
> > @@ -5383,6 +5397,7 @@ Architectures: x86
> > This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> > hypercalls:
> > HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> > 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >
> > Architecture: x86
> > @@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> > flush hypercalls by Hyper-V) so userspace should disable KVM identification
> > in CPUID and only exposes Hyper-V identification. In this case, guest
> > thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu.
> > +
> > +One dirty ring is defined as below internally:
> > +
> > +struct kvm_dirty_ring {
> > + u32 dirty_index;
> > + u32 reset_index;
> > + u32 size;
> > + u32 soft_limit;
> > + struct kvm_dirty_gfn *dirty_gfns;
> > + struct kvm_dirty_ring_indices *indices;
> > + int index;
> > +};
> > +
> > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > +For each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > + __u32 pad;
>
> How about sticking a length here?
> This way huge pages can be dirtied in one go.

Not just huge pages, but any contiguous range of dirty pages could be
reported far more concisely. Thanks,

Alex