Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state

From: Sean Christopherson
Date: Tue Oct 15 2024 - 11:59:55 EST


On Tue, Oct 15, 2024, Vitaly Kuznetsov wrote:
> Nikolas Wipper <nik.wipper@xxxxxx> writes:
>
> > On 10.10.24 10:57, Vitaly Kuznetsov wrote:
>
> ...
>
> >>> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
> >>> +
> >>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
> >>> +{
> >>> + return vcpu->arch.hyperv_enabled &&
> >>> + READ_ONCE(vcpu->arch.hyperv->suspended);
> >>
> >> I don't think READ_ONCE() means anything here, does it?
> >>
> >
> > It does prevent compiler optimisations and is actually required[1]. Also
> > it makes clear that this variable is shared, and may be accessed from
> > remote CPUs.
> >
> > [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access
>
> It certainly does no harm but I think if we follow 'Loads from and
> stores to shared (but non-atomic) variables should be protected with the
> READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them
> all over KVM/kernel ;-) And personally, this makes reading the code
> harder.
>
> To my (very limited) knowledge, we really need READ_ONCE()s when we need
> to have some sort of a serialization, e.g. the moment when this read
> happens actually makes a difference. If we can e.g. use a local variable
> in the beginning of a function and replace all READ_ONCE()s with
> reading this local variable -- then we don't need READ_ONCE()s and are
> OK with possible compiler optimizations. Similar (reversed) thoughts go
> to WRITE_ONCE().
>
> I think it's OK to keep them but it would be nice (not mandatory IMO,
> but nice) to have a comment describing which particular synchronization
> we are achieving (== the compiler optimization scenario we are protecting
> against).
>
> In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly
> looked at all kvm_hv_vcpu_suspended() call sites (there are three) in
> your series but couldn't think of a place where the READ_ONCE() makes a
> real difference. kvm_hv_hypercall_complete() looks pretty safe
> anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified
> significantly if we merge 'suspended' with 'waiting_on': instead of
>
> kvm_for_each_vcpu(i, v, vcpu->kvm) {
> vcpu_hv = to_hv_vcpu(v);
>
> if (kvm_hv_vcpu_suspended(v) &&
> READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
> ...
>
> you will have just
>
> kvm_for_each_vcpu(i, v, vcpu->kvm) {
> vcpu_hv = to_hv_vcpu(v);
>
> if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) {
> ...
> (and yes, I also think that READ_ONCE() is superfluous here, as real
> (non-speculative) write below can't happen _before_ the check )
>
> The last one, kvm_vcpu_running(), should also be indifferent to
> READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of
> course, but I hope you got my line of thought.

I don't think you're missing anything. In general, all of this code is more than
a bit heavy-handed and lacks any kind of precision, which makes it *really* hard
to see what actually guarantees a vCPU won't get stuck blocking.

Writers synchronize SRCU and readers are required to acquire SRCU, but there's
no actual data tagged as being protected by SRCU, i.e. tlb_flush_inhibit should
be __rcu.

All of the {READ,WRITE}_ONCE() stuff provides some implicit compiler barriers,
but the actual protection to ensure a vCPU either observes inhibit=false or a wake
event is provided by the smp_wmb() in __kvm_make_request().

And from a performance perspective, synchronizing on kvm->srcu is going to be
susceptible to random slowdowns, because writers will have to wait until all vCPUs
drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs
are faulting in memory from swap, uninhibiting a TLB flushes could be stalled
unnecessarily for an extended duration.

Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically
misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely
more appropriate.

Before we spend too much time cleaning things up, I want to first settle on the
overall design, because it's not clear to me that punting HvTranslateVirtualAddress
to userspace is a net positive. We agreed that VTLs should be modeled primarily
in userspace, but that doesn't automatically make punting everything to userspace
the best option, especially given the discussion at KVM Forum with respect to
mplementing VTLs, VMPLs, TD partitions, etc.

The cover letters for this series and KVM_TRANSLATE2 simply say they're needed
for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and
KVM. And it very much is a split, because there are obviously a lot of details
around TlbFlushInhibit that bleed into KVM.

Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The
TLFS just says

After the memory intercept routine performs instruction completion, it should
clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.

but I can't find anything that says _how_ it clears TlbFlushInhibit.

[*] https://lore.kernel.org/all/20240609154945.55332-8-nsaenz@xxxxxxxxxx