Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking

From: Jason Wang
Date: Tue Dec 24 2019 - 22:23:54 EST



On 2019/12/24 äå11:08, Peter Xu wrote:
On Tue, Dec 24, 2019 at 02:16:04PM +0800, Jason Wang wrote:
+struct kvm_dirty_ring {
+ u32 dirty_index;

Does this always equal to indices->avail_index?
Yes, but here we keep dirty_index as the internal one, so we never
need to worry about illegal userspace writes to avail_index (then we
never read it from kernel).


I get you. But I'm not sure it's wroth to bother. We meet similar issue in virtio, the used_idx is not expected to write by userspace. We simply add checks.

But anyway, I'm fine if you want to keep it (maybe with a comment to explain).




+ u32 reset_index;
+ u32 size;
+ u32 soft_limit;
+ struct kvm_dirty_gfn *dirty_gfns;
+ struct kvm_dirty_ring_indices *indices;

Any reason to keep dirty gfns and indices in different places? I guess it is
because you want to map dirty_gfns as readonly page but I couldn't find such
codes...
That's a good point! We should actually map the dirty gfns as read
only. I've added the check, something like this:

static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
{
struct kvm_vcpu *vcpu = file->private_data;
unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;

/* If to map any writable page within dirty ring, fail it */
if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
vma->vm_flags & VM_WRITE)
return -EINVAL;

vma->vm_ops = &kvm_vcpu_vm_ops;
return 0;
}

I also changed the test code to cover this case.

[...]


Looks good.



+struct kvm_dirty_ring_indices {
+ __u32 avail_index; /* set by kernel */
+ __u32 fetch_index; /* set by userspace */

Is this better to make those two cacheline aligned?
Yes, Paolo should have mentioned that but I must have missed it! I
hope I didn't miss anything else.

[...]

+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+ u32 cur_slot, next_slot;
+ u64 cur_offset, next_offset;
+ unsigned long mask;
+ u32 fetch;
+ int count = 0;
+ struct kvm_dirty_gfn *entry;
+ struct kvm_dirty_ring_indices *indices = ring->indices;
+ bool first_round = true;
+
+ fetch = READ_ONCE(indices->fetch_index);
+
+ /*
+ * Note that fetch_index is written by the userspace, which
+ * should not be trusted. If this happens, then it's probably
+ * that the userspace has written a wrong fetch_index.
+ */
+ if (fetch - ring->reset_index > ring->size)
+ return -EINVAL;
+
+ if (fetch == ring->reset_index)
+ return 0;
+
+ /* This is only needed to make compilers happy */
+ cur_slot = cur_offset = mask = 0;
+ while (ring->reset_index != fetch) {
+ entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+ next_slot = READ_ONCE(entry->slot);
+ next_offset = READ_ONCE(entry->offset);
+ ring->reset_index++;
+ count++;
+ /*
+ * Try to coalesce the reset operations when the guest is
+ * scanning pages in the same slot.
+ */
+ if (!first_round && next_slot == cur_slot) {

initialize cur_slot to -1 then we can drop first_round here?
cur_slot is unsigned. We can force cur_slot to be s64 but maybe we
can also simply keep the first_round to be clear from its name.

[...]


Sure.



+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
+{
+ struct kvm_dirty_gfn *entry;
+ struct kvm_dirty_ring_indices *indices = ring->indices;
+
+ /*
+ * Note: here we will start waiting even soft full, because we
+ * can't risk making it completely full, since vcpu0 could use
+ * it right after us and if vcpu0 context gets full it could
+ * deadlock if wait with mmu_lock held.
+ */
+ if (kvm_get_running_vcpu() == NULL &&
+ kvm_dirty_ring_soft_full(ring))
+ return -EBUSY;
+
+ /* It will never gets completely full when with a vcpu context */
+ WARN_ON_ONCE(kvm_dirty_ring_full(ring));
+
+ entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+ entry->slot = slot;
+ entry->offset = offset;
+ smp_wmb();

Better to add comment to explain this barrier. E.g pairing.
Will do.


+ ring->dirty_index++;
+ WRITE_ONCE(indices->avail_index, ring->dirty_index);

Is WRITE_ONCE() a must here?
I think not, but seems to be clearer that we're publishing something
explicilty to userspace. Since asked, I'm actually curious on whether
immediate memory writes like this could start to affect perf from any
of your previous perf works?


I never measure the impact for a specific WRITE_ONCE(). But we don't do this in virtio/vhost. Maybe the maintainers can give more comments on this.

Thanks



Thanks,