Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

From: Mathias Krause
Date: Fri Feb 17 2023 - 11:07:58 EST


On 16.02.23 18:32, Sean Christopherson wrote:
> On Tue, Feb 14, 2023, Mathias Krause wrote:
>> On 13.02.23 18:05, Sean Christopherson wrote:
>> However, I still remain confident that this makes sense from a uarch point of
>> view. Touching less cache lines should be a win -- even if I'm unable to
>> measure it. By preserving more cachelines during a VMEXIT, guests should be
>> able to resume their work faster (assuming they still need these cachelines).
>
> Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in
> fewer cache lines being touched, at least not in a persistent, maintainable way.
> While every VM-Exit accesses a lot of state, it's most definitely still a small
> percentage of kvm_vcpu. And for the VM-Exits that really benefit from optimized
> handling, a.k.a. the fastpath exits, that percentage is even lower.

Yeah, that's probably true.

> On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu,
> kvm_vcpu_arch, and vcpu_{vmx,svm}. The majority of fields accessed on every VM-Exit
> are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in
> the common structures, e.g. registers in kvm_vcpu_arch and things like requests
> in kvm_vcpu.
>
> Naively compating any of the three (well, four) structures might result in fewer
> cache lines being touched, but mostly by coincidence. E.g. because compacting
> part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the
> compaction itself.

Fortunately, kvm_vcpu is embedded as first member in vcpu_{vmx,svm}, so
all three share a common "header." Optimizations done for kvm_vcpu will
therefore benefit the vendor specific structures too. However, you're
right that this will implicitly change the layout for the remainder of
vcpu_{vmx,svm} and might even have a negative impact regarding cacheline
usage. But, as my changes chop off exactly 128 bytes from kvm_vcpu,
that's not the case here. But I can see that this is "coincidence" and
fragile in the long run.

> If we really wanted to optimize cache line usage, our best bet would be to identify
> the fields that are accessed in the fastpath run loop and then pack those fields
> into as few cache lines as possible. And to do that in a maintainable way, we'd
> probably need something like ASI[*] to allow us to detect changes that result in
> the fastpath handling accessing fields that aren't in the cache-optimized region.
>

> I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
> very, very low. For optimized workloads, there simply aren't very many VM-Exits,
> e.g. the majority of exits on a modern CPU are due to timer ticks. And even those
> will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
> virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
> allows for a tickless host.

Well, for guests running grsecurity kernels, there's also the CR0.WP
toggling triggering VMEXITs, which happens a lot! -- at least until
something along the lines of [1] gets merged *hint ;)*

[1]
https://lore.kernel.org/all/20230201194604.11135-1-minipli@xxxxxxxxxxxxxx/

>
> https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx

Heh, that RFC is from February last year and it looks like it stalled at
that point. But I guess you only meant patch 44 anyway, that splits up
kvm_vcpu_arch:
https://lore.kernel.org/all/20220223052223.1202152-45-junaids@xxxxxxxxxx/.
It does that for other purposes, though, which might conflict with the
performance aspect I'm mostly after here. Anyways, I got your point. If
we care about cacheline footprint, we should do a more radical change
and group hot members together instead of simply shrinking the structs
involved.

>>> And as you observed, imperfect struct layouts are highly unlikely to have a
>>> measurable impact on performance. The types of operations that are involved in
>>> a world switch are just too costly for the layout to matter much. I do like to
>>> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
>>> more performant, doesn't require ongoing mainteance, and/or also improves the code
>>> quality.
>>
>> Any pointers to measure the "more performant" aspect?
>
> TL;DR: not really.
>
> What I've done in the past is run a very tight loop in the guest, and then measure
> latency from the host by hacking KVM. Measuring from the guest works, e.g. we have
> a variety of selftests that do exactly that, but when trying to shave cycles in
> the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to
> skew the measurement. My quick-and-dirty solution has always been to just hack
> KVM to measure latency with IRQs disabled, but a more maintainable approach would
> be to add smarts somewhere to sanitize the results, e.g. to throw away outliers
> where the guest likely got interrupted.
>
> I believe we've talked about adding a selftest to measure fastpath latency, e.g.
> by writing MSR_IA32_TSC_DEADLINE in a tight loop.
>
> However, that's not going to be useful in this case since you are interested in
> measuring the impact of reducing the host's L1D footprint. If the guest isn't
> cache-constrainted, reducing the host's cache footprint isn't going to impact
> performance since there's no contention.

Yeah, it's hard to find a test case measuring the gains. I looked into
running Linux userland workloads initially, but saw no real impact, as
the sdtdev was already too high. But, as you pointed out, a
micro-benchmark is of no use either, so it's all hand-waving only. :D

> Running a micro benchmark in the guest that aims to measure cache performance might
> work, but presumably those are all userspace tests, i.e. you'd end up measuring
> the impact of the guest kernel too. And they wouldn't consistently trigger VM-Exits,
> so it would be difficult to prove the validity of the results.

Jepp. It's all just gut feeling, unfortunately.

> I suppose you could write a dedicated selftest or port a micro benchmark to run
> as a selftest (or KVM-unit-test)?
>
>> I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's
>> already counting cycles, but the numbers are too unstable, even if I pin the
>> test to a given CPU, disable turbo mode, SMT, use the performance cpu
>> governor, etc.
>
> Heh, you might have picked quite possibly the worst way to measure VM-Exit
> performance :-)
>
> The guest code in that test that's measuring latency runs at L2. For VMREADs
> and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur
> (the access will be handled purely in ucode). And for accesses that do cause a
> VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_
> heavy path (~10k cycles). Even if the exit is handled by KVM (in L0), it's still
> a relatively slow, heavy path.

I see. I'll have a look at the selftests and see if I can repurpose one
of them. But, as you noted, a microbenchmark might not be what I'm
after. It's more about identifying the usage patterns for hot VMEXIT
paths and optimize these.

>>> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
>>> a sub-optimal layouy and the change is arguably warranted even without the change
>>> in size. Ditto for kvm_pmu, logically I think it makes sense to have the version
>>> at the very top.
>>
>> Yeah, was exactly thinking the same when modifying kvm_pmu.
>>
>>> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
>>> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
>>> egregious field(s) just isn't worth the cost in the long term.
>>
>> Heh, just found this gem in vcpu_vmx:
>>
>> struct vcpu_vmx {
>> [...]
>> union vmx_exit_reason exit_reason;
>>
>> /* XXX 44 bytes hole, try to pack */
>>
>> /* --- cacheline 123 boundary (7872 bytes) --- */
>> struct pi_desc pi_desc __attribute__((__aligned__(64)));
>> [...]
>>
>> So there are, in fact, some bigger holes left.
>
> Ya. Again, I'm definitely ok cleaning up the truly heinous warts and/or doing
> a targeted, deliberate refactor of structures. What I don't want to do is
> shuffle fields around purely to save a few bytes here and there.

Got it. I'll back out the reshuffling ones and only keep the ones for
kvm_pmu and kvm_mmu_memory_cache, as these are more like straight cleanups.

Thanks,
Mathias