Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

From: Avi Kivity
Date: Tue Feb 28 2012 - 06:47:09 EST


On 02/23/2012 01:35 PM, Takuya Yoshikawa wrote:
> We have seen some problems of the current implementation of
> get_dirty_log() which uses synchronize_srcu_expedited() for updating
> dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
> order of latency when we use VGA displays.
>
> Furthermore the recent discussion on the following thread
> "srcu: Implement call_srcu()"
> http://lkml.org/lkml/2012/1/31/211
> also motivated us to implement get_dirty_log() without SRCU.
>
> This patch achieves this goal without sacrificing the performance of
> both VGA and live migration: in practice the new code is much faster
> than the old one unless we have too many dirty pages.
>
> Implementation:
>
> The key part of the implementation is the use of xchg() operation for
> clearing dirty bits atomically. Since this allows us to update only
> BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> until every dirty bit is cleared again for the next call.

What about using cmpxchg16b? That should reduce locked ops by a factor
of 2 (but note it needs 16 bytes alignment).

>
> Although some people may worry about the problem of using the atomic
> memory instruction many times to the concurrently accessible bitmap,
> it is usually accessed with mmu_lock held and we rarely see concurrent
> accesses: so what we need to care about is the pure xchg() overheads.
>
> Another point to note is that we do not use for_each_set_bit() to check
> which ones in each BITS_PER_LONG pages are actually dirty. Instead we
> simply use __ffs() and __fls() and pass the range in between the two
> positions found by them to kvm_mmu_write_protect_pt_range().

This seems artificial.

> Even though the passed range may include clean pages, it is much faster
> than repeatedly call find_next_bit() due to the locality of dirty pages.

Perhaps this is due to the implementation of find_next_bit()? would
using bsf improve things?

> Performance:
>
> The dirty-log-perf unit test showed nice improvement, some times faster
> than before, when the number of dirty pages was below 8K. For other
> cases we saw a bit of regression but still enough fast compared to the
> processing of these dirty pages in the userspace.
>
> For real workloads, both VGA and live migration, we have observed pure
> improvement: when the guest was reading a file, we originally saw a few
> ms of latency, but with the new method the latency was 50us to 300us.
>
>
> /**
> - * write_protect_slot - write protect a slot for dirty logging
> - * @kvm: the kvm instance
> - * @memslot: the slot we protect
> - * @dirty_bitmap: the bitmap indicating which pages are dirty
> - * @nr_dirty_pages: the number of dirty pages
> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
> + * @kvm: kvm instance
> + * @log: slot id and address to which we copy the log
> *
> - * We have two ways to find all sptes to protect:
> - * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
> - * checks ones that have a spte mapping a page in the slot.
> - * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
> + * We need to keep it in mind that VCPU threads can write to the bitmap
> + * concurrently. So, to avoid losing data, we keep the following order for
> + * each bit:
> *
> - * Generally speaking, if there are not so many dirty pages compared to the
> - * number of shadow pages, we should use the latter.
> + * 1. Take a snapshot of the bit and clear it if needed.
> + * 2. Write protect the corresponding page.
> + * 3. Flush TLB's if needed.
> + * 4. Copy the snapshot to the userspace.
> *
> - * Note that letting others write into a page marked dirty in the old bitmap
> - * by using the remaining tlb entry is not a problem. That page will become
> - * write protected again when we flush the tlb and then be reported dirty to
> - * the user space by copying the old bitmap.
> + * Between 2 and 3, the guest may write to the page using the remaining TLB
> + * entry. This is not a problem because the page will be reported dirty at
> + * step 4 using the snapshot taken before and step 3 ensures that successive
> + * writes will be logged for the next call.
> */
> -static void write_protect_slot(struct kvm *kvm,
> - struct kvm_memory_slot *memslot,
> - unsigned long *dirty_bitmap,
> - unsigned long nr_dirty_pages)
> -{
> - spin_lock(&kvm->mmu_lock);
> -
> - /* Not many dirty pages compared to # of shadow pages. */
> - if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
> - gfn_t offset;
> -
> - for_each_set_bit(offset, dirty_bitmap, memslot->npages)
> - kvm_mmu_write_protect_pt_range(kvm, memslot, offset, offset);
> -
> - kvm_flush_remote_tlbs(kvm);
> - } else
> - kvm_mmu_slot_remove_write_access(kvm, memslot->id);
> -
> - spin_unlock(&kvm->mmu_lock);
> -}
> -
> -/*
> - * Get (and clear) the dirty memory log for a memory slot.
> - */
> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> - struct kvm_dirty_log *log)
> +int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
> {
> int r;
> struct kvm_memory_slot *memslot;
> - unsigned long n, nr_dirty_pages;
> + unsigned long n, i;
> + unsigned long *dirty_bitmap;
> + unsigned long *dirty_bitmap_buffer;
> + bool is_dirty = false;
>
> mutex_lock(&kvm->slots_lock);
>
> @@ -3098,49 +3075,41 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> goto out;
>
> memslot = id_to_memslot(kvm->memslots, log->slot);
> + dirty_bitmap = memslot->dirty_bitmap;
> r = -ENOENT;
> - if (!memslot->dirty_bitmap)
> + if (!dirty_bitmap)
> goto out;
>
> n = kvm_dirty_bitmap_bytes(memslot);
> - nr_dirty_pages = memslot->nr_dirty_pages;
> + dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> + memset(dirty_bitmap_buffer, 0, n);
>
> - /* If nothing is dirty, don't bother messing with page tables. */
> - if (nr_dirty_pages) {
> - struct kvm_memslots *slots, *old_slots;
> - unsigned long *dirty_bitmap, *dirty_bitmap_head;
> + spin_lock(&kvm->mmu_lock);
>
> - dirty_bitmap = memslot->dirty_bitmap;
> - dirty_bitmap_head = memslot->dirty_bitmap_head;
> - if (dirty_bitmap == dirty_bitmap_head)
> - dirty_bitmap_head += n / sizeof(long);
> - memset(dirty_bitmap_head, 0, n);
> + for (i = 0; i < n / sizeof(long); i++) {
> + unsigned long bits;
> + gfn_t start, end;
>
> - r = -ENOMEM;
> - slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
> - if (!slots)
> - goto out;
> -
> - memslot = id_to_memslot(slots, log->slot);
> - memslot->nr_dirty_pages = 0;
> - memslot->dirty_bitmap = dirty_bitmap_head;
> - update_memslots(slots, NULL);
> + if (!dirty_bitmap[i])
> + continue;
>
> - old_slots = kvm->memslots;
> - rcu_assign_pointer(kvm->memslots, slots);
> - synchronize_srcu_expedited(&kvm->srcu);
> - kfree(old_slots);
> + is_dirty = true;
> + bits = xchg(&dirty_bitmap[i], 0);
> + dirty_bitmap_buffer[i] = bits;
>
> - write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
> + start = i * BITS_PER_LONG + __ffs(bits);
> + end = i * BITS_PER_LONG + __fls(bits);
>
> - r = -EFAULT;
> - if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
> - goto out;
> - } else {
> - r = -EFAULT;
> - if (clear_user(log->dirty_bitmap, n))
> - goto out;
> + kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);

If indeed the problem is find_next_bit(), then we could hanve
kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
a parameter. This would allow covering just this function with the
spinlock, not the xchg loop.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/