Re: [PATCH 2/2] KVM: nSVM: temporarly save vmcb12's efer, cr0 and cr4 to avoid TOC/TOU races

From: Sean Christopherson
Date: Wed Aug 11 2021 - 19:25:32 EST


On Wed, Aug 11, 2021, Maxim Levitsky wrote:
> On Mon, 2021-08-09 at 16:53 +0200, Emanuele Giuseppe Esposito wrote:
> > @@ -1336,7 +1335,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > if (!(save->cr0 & X86_CR0_PG) ||
> > !(save->cr0 & X86_CR0_PE) ||
> > (save->rflags & X86_EFLAGS_VM) ||
> > - !nested_vmcb_valid_sregs(vcpu, save))
> > + !nested_vmcb_valid_sregs(vcpu, save, save->efer, save->cr0,
> > + save->cr4))
> > goto out_free;
> >
> > /*
> The disadvantage of my approach is that fields are copied twice, once from
> vmcb12 to its local copy, and then from the local copy to vmcb02, however
> this approach is generic in such a way that TOC/TOI races become impossible.
>
> The disadvantage of your approach is that only some fields are copied and
> there is still a chance of TOC/TOI race in the future.

The partial copy makes me nervous too. I also don't like pulling out select
registers and passing them by value; IMO the resulting code is harder to follow
and will be more difficult to maintain, e.g. it won't scale if the list of regs
to check grows.

But I don't think we need to copy _everything_. There's also an opportunity to
clean up svm_set_nested_state(), though the ABI ramifications may be problematic.

Instead of passing vmcb_control_area and vmcb_save_area to nested_vmcb_valid_sregs()
and nested_vmcb_valid_sregs(), pass svm_nested_state and force the helpers to extract
the save/control fields from the nested state. If a new check is added to KVM, it
will be obvious (and hopefully fail) if the state being check is not copied from vmcb12.

Regarding svm_set_nested_state(), if we can clobber svm->nested.ctl and svm->nested.save
(doesn't exist currently) on a failed ioctl(), then the temporary allocations for those
can be replaced with using svm->nested as the buffer.

And to mitigate the cost of copying to a kernel-controlled cache, we should use
the VMCB Clean bits as they're intended.

Each set bit in the VMCB Clean field allows the processor to load one guest
register or group of registers from the hardware cache;

E.g. copy from vmcb12 iff the clean bit is clear. Then we could further optimize
nested VMRUN to skip checks based on clean bits.