Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

From: Andy Lutomirski
Date: Sun May 21 2017 - 16:19:50 EST


On 05/21/2017 12:53 AM, Roman Penyaev wrote:
On Sun, May 21, 2017 at 5:31 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
On 05/19/2017 09:14 AM, Roman Penyaev wrote:

Hi folks,

After experiencing guest double faults (sometimes triple faults) on
3.16 guest kernels with the following common pattern:


[cut]


Further tracking of VMCB states before and after VMRUN showed, that
CPL becomes 0 when VMEXIT happens with the following SS segment:

ss = {
selector = 0x2b,
attrib = 0x400,
limit = 0xffffffff,
base = 0x0
},

cpl = 0x3

Then on next VMRUN VMCB looks as the following:

ss = {
selector = 0x2b,
attrib = 0x0, <<< dropped to 0
limit = 0xffffffff,
base = 0x0
},

cpl = 0x0, <<< dropped to 0

Obviously it was changed between VMRUN calls. The following backtrace
shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:

CPU: 55 PID: 59531 Comm: kvm
[<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
[<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
[<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
[<ffffffff8180097c>] ? __schedule+0x28c/0xa10
[<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0

SS segment which came from QEMU had the following struct members:

SS->base = 0
SS->limit = ffffffff
SS->selector = 2b
SS->type = 0
SS->present = 0
SS->dpl = 0
SS->db = 0
SS->s = 0
SS->l = 0
SS->g = 0
SS->avl = 0
SS->unusable = 1

Indeed, on last VMEXIT SS segment does not have (P) present bit set in
segment attributes:

(gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
$1 = 0


Huh? How is that even possible? It should not be possible to actually run
the vCPU with a non-NULL SS that isn't present.

That is utterly good question :) I do not know. According to my shallow
understanding (P) bit is only a hint for CPU that corresponding segment was
read from gdt and now is cached in private CPU registers (attributes).
Am I right?

At least what I see that it is quite often the case when we exit from VMRUN
with segment not present then VMRUN is resumed and on next vmexit segment has
correct attributes.

How would you cause it to happen?

We run fio and iperf tests in guests for a couple of days. Nothing more,
nothing special. Guests are bare debians with 3.16 kernels.


Unless... is this the sysret_ss_attrs issue?

What is the issue? This one

https://lkml.org/lkml/2015/4/24/770

Yes.

But I was thinking about it wrong, since this is probably 64-bit userspace, not 32-bit userspace. Here's my theory:

1. User task A does a syscall. It's not in kernel mode with SS != 0.

2. The scheduler runs and switches to task B. SS != 0.

2. Kernel enters user mode for task B.

3. User task B gets interrupted. Kernel ends up running with SS = 0.

4. Kernel switches back to task A. SS == 0.

5. Kernel does SYSRET. SS == __USER_DS, but SS's attributes are messed up.

6. QEMU does whatever it does that inspires it to zap SS's attributes.

7. Boom.

If task B were 32-bit, then the vDSO would fix up SS, so there would only be a 1-instruction window for problems.

To check this theory, you could try backporting this to the guest and seeing if the problem goes away:

commit 61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
Author: Andy Lutomirski <luto@xxxxxxxxxx>
Date: Sun Apr 26 16:47:59 2015 -0700

x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue



So when on VMEXIT we have such SS state (P bit is not set) and QEMU
just decides to synchronize registers the following happens on QEMU
side:

kvm_cpu_synchronize_state():
kvm_arch_get_registers():
...
get_seg():
if (rhs->unusable) {
lhs->flags = 0; <<< SS is unusable [(P) is not set)
<<< all attributes are dropped to 0.
}
cpu->kvm_vcpu_dirty = true; <<< Mark VCPU state as dirty


Looks like the bug is in QEMU, then, right?

KVM SVM restores CPL from unusable selector, obviously this is not nice.

I would imagine that QEMU shouldn't be feeding KVM such a selector. Also, there's an invariant that SS.DPL == CPL, at least most of the time, although this SYSRET issue may be the exception.

Paolo, what's the intended behavior here? Is the bug in KVM or in QEMU?


arch/x86/kvm/svm.c:svm_set_segment():

if (var->unusable)
s->attrib = 0;
...
if (seg == VCPU_SREG_SS)
svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;


Meanwhile QEMU resets attributes, despite the fact that DPL (which is passed
from KVM) is correct.

So it is not clear what is the proper way to fix that. What is clear is
that CPL is set to 0 because of this game with registers on both sides.
Now the question is what side to fix or probably both.


Couldn't you just fix this code
in QEMU by, say, deleting it?

Certainly, but would be nice to listen to KVM maintainers. At least the issue
is clear and what is left is a proper one-line fix :)

--Andy