Re: [PATCH v2] KVM: SVM: use vmsave/vmload for saving/restoring additional host state

From: Michael Roth
Date: Wed Dec 16 2020 - 10:13:06 EST


On Tue, Dec 15, 2020 at 12:17:47PM -0600, Michael Roth wrote:
> On Mon, Dec 14, 2020 at 02:29:46PM -0800, Andy Lutomirski wrote:
> >
> >
> > > On Dec 14, 2020, at 2:02 PM, Michael Roth <michael.roth@xxxxxxx> wrote:
> > >
> > > On Mon, Dec 14, 2020 at 11:38:23AM -0800, Sean Christopherson wrote:
> > >> +Andy, who provided a lot of feedback on v1.
> > >> On Mon, Dec 14, 2020, Michael Roth wrote:
> > >> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> > >>> Suggested-by: Tom Lendacky <thomas.lendacky@xxxxxxx>
> > >>> Signed-off-by: Michael Roth <michael.roth@xxxxxxx>
> > >>> ---
> > >>> v2:
> > >>> * rebase on latest kvm/next
> > >>> * move VMLOAD to just after vmexit so we can use it to handle all FS/GS
> > >>> host state restoration and rather than relying on loadsegment() and
> > >>> explicit write to MSR_GS_BASE (Andy)
> > >>> * drop 'host' field from struct vcpu_svm since it is no longer needed
> > >>> for storing FS/GS/LDT state (Andy)
> > >>> ---
> > >>> arch/x86/kvm/svm/svm.c | 44 ++++++++++++++++--------------------------
> > >>> arch/x86/kvm/svm/svm.h | 14 +++-----------
> > >>> 2 files changed, 20 insertions(+), 38 deletions(-)
> > >>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > >>> index 0e52fac4f5ae..fb15b7bd461f 100644
> > >>> --- a/arch/x86/kvm/svm/svm.c
> > >>> +++ b/arch/x86/kvm/svm/svm.c
> > >>> @@ -1367,15 +1367,19 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > >>> vmcb_mark_all_dirty(svm->vmcb);
> > >>> }
> > >>> -#ifdef CONFIG_X86_64
> > >>> - rdmsrl(MSR_GS_BASE, to_svm(vcpu)->host.gs_base);
> > >>> -#endif
> > >>> - savesegment(fs, svm->host.fs);
> > >>> - savesegment(gs, svm->host.gs);
> > >>> - svm->host.ldt = kvm_read_ldt();
> > >>> -
> > >>> - for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
> > >>> + for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) {
> > >>> rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
> > >>> + }
> > >
> > > Hi Sean,
> > >
> > > Hopefully I've got my email situation sorted out now...
> > >
> > >> Unnecessary change that violates preferred coding style. Checkpatch explicitly
> > >> complains about this.
> > >> WARNING: braces {} are not necessary for single statement blocks
> > >> #132: FILE: arch/x86/kvm/svm/svm.c:1370:
> > >> + for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) {
> > >> rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
> > >> +
> > >
> > > Sorry, that was an artifact from an earlier version of the patch that I
> > > failed to notice. I'll make sure to run everything through checkpatch
> > > going forward.
> > >
> > >>> +
> > >>> + asm volatile(__ex("vmsave")
> > >>> + : : "a" (page_to_pfn(sd->save_area) << PAGE_SHIFT)
> > >> I'm pretty sure this can be page_to_phys().
> > >>> + : "memory");
> > >> I think we can defer this until we're actually planning on running the guest,
> > >> i.e. put this in svm_prepare_guest_switch().
> > >
> > > One downside to that is that we'd need to do the VMSAVE on every
> > > iteration of vcpu_run(), as opposed to just once when we enter from
> > > userspace via KVM_RUN. It ends up being a similar situation to Andy's
> > > earlier suggestion of moving VMLOAD just after vmexit, but in that case
> > > we were able to remove an MSR write to MSR_GS_BASE, which cancelled out
> > > the overhead, but in this case I think it could only cost us extra.
> >
> > If you want to micro-optimize, there is a trick you could play: use WRGSBASE if available. If X86_FEATURE_GSBASE is available, you could use WRGSBASE to restore GSBASE and defer VMLOAD to vcpu_put(). This would need benchmarking on Zen 3 to see if it’s worthwhile.
>
> I'll give this a try. The vmsave only seems to be 100-200 in and of itself so
> I'm not sure there's much to be gained, but would be good to know either
> way.

It looks like it does save us ~20-30 cycles vs. vmload, but maybe not
enough to justify the added complexity. Additionally, since we still
need to call vmload when we exit to userspace, it ends up being a bit
slower for this particular workload at least. So for now I'll plan on
sticking to vmload'ing after vmexit and moving that to the asm code
if there are no objections.

current v2 patch, sample 1
ioctl entry: 1204722748832
pre-run: 1204722749408 ( +576)
post-run: 1204722750784 (+1376)
ioctl exit: 1204722751360 ( +576)
total cycles: 2528

current v2 patch, sample 2
ioctl entry: 1204722754784
pre-vmrun: 1204722755360 ( +576)
post-vmrun: 1204722756720 (+1360)
ioctl exit: 1204722757312 ( +592)
total cycles 2528

wrgsbase, sample 1
ioctl entry: 1346624880336
pre-vmrun: 1346624880912 ( +576)
post-vmrun: 1346624882256 (+1344)
ioctl exit: 1346624882912 ( +656)
total cycles 2576

wrgsbase, sample 2
ioctl entry: 1346624886272
pre-vmrun: 1346624886832 ( +560)
post-vmrun: 1346624888176 (+1344)
ioctl exit: 1346624888816 ( +640)
total cycles: 2544

>
> >
> >