Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

From: Peter Xu
Date: Sun Dec 15 2019 - 12:33:10 EST


On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>> What depends on what here? Looks suspicious ...
> >>
> >> Hmm, I think maybe it can be removed because the entry pointer
> >> reference below should be an ordering constraint already?
>
> entry->xxx depends on ring->reset_index.

Yes that's true, but...

entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
/* barrier? */
next_slot = READ_ONCE(entry->slot);
next_offset = READ_ONCE(entry->offset);

... I think entry->xxx depends on entry first, then entry depends on
reset_index. So it seems fine because all things have a dependency?

>
> >>> what's the story around locking here? Why is it safe
> >>> not to take the lock sometimes?
> >>
> >> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> >> ring is used. For per-vcpu ring, because that will only happen with
> >> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> >> is called with lock==false).
>
> FWIW this will be done much more nicely in v2.
>
> >>>> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>>> + if (!page) {
> >>>> + r = -ENOMEM;
> >>>> + goto out_err_alloc_page;
> >>>> + }
> >>>> + kvm->vm_run = page_address(page);
> >>>
> >>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> >>> still. What is wrong with just a pointer and calling put_user?
> >>
> >> I want to make it the start point for sharing fields between
> >> user/kernel per-vm. Just like kvm_run for per-vcpu.
>
> This page is actually not needed at all. Userspace can just map at
> KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there. You can drop
> kvm_vm_run completely.

I changed it because otherwise we use one entry of the padding, and
all the rest of paddings are a waste of memory because we can never
really use the padding as new fields only for the 1st entry which
overlaps with the indices. IMHO that could even waste more than 4k.

(for now we only "waste" 4K for per-vm, kvm_run is already mapped so
no waste there, not to say potentially I still think we can use the
kvm_vm_run in the future)

>
> >>>> + } else {
> >>>> + /*
> >>>> + * Put onto per vm ring because no vcpu context. Kick
> >>>> + * vcpu0 if ring is full.
> >>>
> >>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> >>> critical tasks there, they will be penalized disproportionally?
> >>
> >> Reasonable question. So far we can't avoid it because vcpu exit is
> >> the event mechanism to say "hey please collect dirty bits". Maybe
> >> someway is better than this, but I'll need to rethink all these
> >> over...
> >
> > Maybe signal an eventfd, and let userspace worry about deciding what to
> > do.
>
> This has to be done synchronously. But the vm ring should be used very
> rarely (it's for things like kvmclock updates that write to guest memory
> outside a vCPU), possibly a handful of times in the whole run of the VM.

I've summarized a list of callers who might dirty guest memory in the
other thread, it seems to me that even the kvm clock is using per-vcpu
contexts.

>
> >>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> >>> So how does userspace know what's legal?
> >>> Do you expect it to just try?
> >>
> >> Yep that's what I thought. :)
>
> We should return it for KVM_CHECK_EXTENSION.

OK. I'll drop the versioning.

--
Peter Xu