Re: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

From: Andrew Cooper
Date: Wed Jan 19 2022 - 22:18:20 EST


On 20/01/2022 00:06, Sean Christopherson wrote:
> Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step
> breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or
> MOVSS blocking is active. Setting the flag is necessary to make VM-Entry
> consistency checks happy, as VMX has an invariant that if RFLAGS.TF is
> set and STI/MOVSS blocking is true, then the previous instruction must
> have been STI or MOV/POP, and therefore a single-step #DB must be pending
> since the RFLAGS.TF cannot have been set by the previous instruction,
> i.e. the one instruction delay after setting RFLAGS.TF must have already
> expired.
>
> Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately
> when recording guest state as part of a VM-Exit, but #DB VM-Exits
> intentionally do not treat the #DB as "guest state" as interception of
> the #DB effectively makes the #DB host-owned, thus KVM needs to manually
> set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest.

The problem is that none of this is documented.

Amongst other things, the vmentry consistency check misses the case when
#DB really is pending in ENTRY_INTR_INFO.


It is very clear that to use VT-x/SVM correctly, required reading
includes the core microcode and RTL, which of course all of us have
access to...

> Note, although this bug can be triggered by guest userspace, doing so
> requires IOPL=3, and guest userspace running with IOPL=3 has full access
> to all I/O ports (from the guest's perspective) and can crash/reboot the
> guest any number of ways. IOPL=3 is required because STI blocking kicks
> in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either
> takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF.
>
> MOVSS blocking can be initiated by userspace, but can be coincident with
> a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
> executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
> is problematic only for CPL0 (and only if the guest is crazy enough to
> access a DR in a MOVSS shadow). All other sources of #DBs are either
> suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),

It is more complicated than this and undocumented.  Single step is
discard in a shadow, while data breakpoints are deferred.

I've just run an experiment with code breakpoints, as they're faults
like General Detect:

bool do_unhandled_exception(struct cpu_regs *regs)
{
    static int limit;

    if ( limit++ > 10 )
        return false;

    if ( regs->entry_vector == X86_EXC_DB )
    {
        unsigned int pending_dbg = read_dr6() ^ X86_DR6_DEFAULT;
        unsigned int dr7 = read_dr7(), spurious = 0;

        for ( int i = 0; i < 4; ++i )
            if ( pending_dbg & (1 << i) && ((dr7 >> (2 * i)) & 3) == 0 )
                spurious |= (1 << i);

        printk("#DB at %04x:%p, pending %08x, spurious %x\n",
               regs->cs, _p(regs->ip), pending_dbg ^ spurious, spurious);
        write_dr6(X86_DR6_DEFAULT);

        return true;
    }

    return false;
}

void test_main(void)
{
    extern char l0[] asm ("0f"), l1[] asm ("1f");
    extern char l2[] asm ("2f"), l3[] asm ("3f");
    unsigned int tmp;

    write_cr4(read_cr4() | X86_CR4_DE);

    write_dr0(_u(l0));
    write_dr1(_u(l1));
    write_dr2(_u(l2));
    write_dr3(_u(l3));

    write_dr7(/* DR7_SYM(0, G, X) | */
              /* DR7_SYM(1, G, X) | */
              DR7_SYM(2, G, X) |
              /* DR7_SYM(3, G, X) | */
              X86_DR7_GE);

    asm volatile("mov %%ss, %[tmp];"
                 "pushf;"
                 "pushf;"
                 "orl $"STR(X86_EFLAGS_TF)", (%%"_ASM_SP");"
                 "popf;"
                 "nop;"
                 "0: nop;"
                 "1: mov %[tmp], %%ss;"
                 "2: nop;"
                 "3: popf;"
                 : [tmp] "=r" (tmp));

    /* If the VM is still alive, it didn't suffer a vmentry failure. */
    xtf_success("Success: Not vulnerable to XSA-308\n");
}

$ objdump -d tests/xsa-308/test-hvm64-xsa-308 | grep -A25 '<test_main>:'
001048a0 <test_main>:
  1048a0:    0f 20 e0                 mov    %cr4,%rax
  1048a3:    48 83 c8 08              or     $0x8,%rax
  1048a7:    0f 22 e0                 mov    %rax,%cr4
  1048aa:    b8 df 48 10 00           mov    $0x1048df,%eax
  1048af:    0f 23 c0                 mov    %rax,%db0
  1048b2:    b8 e0 48 10 00           mov    $0x1048e0,%eax
  1048b7:    0f 23 c8                 mov    %rax,%db1
  1048ba:    b8 e2 48 10 00           mov    $0x1048e2,%eax
  1048bf:    0f 23 d0                 mov    %rax,%db2
  1048c2:    b8 e3 48 10 00           mov    $0x1048e3,%eax
  1048c7:    0f 23 d8                 mov    %rax,%db3
  1048ca:    b8 20 02 00 00           mov    $0x220,%eax
  1048cf:    0f 23 f8                 mov    %rax,%db7
  1048d2:    8c d0                    mov    %ss,%eax
  1048d4:    9c                       pushf 
  1048d5:    9c                       pushf 
  1048d6:    81 0c 24 00 01 00 00     orl    $0x100,(%rsp)
  1048dd:    9d                       popf  
  1048de:    90                       nop
  1048df:    90                       nop
  1048e0:    8e d0                    mov    %eax,%ss
  1048e2:    90                       nop
  1048e3:    9d                       popf  
  1048e4:    bf 00 3e 11 00           mov    $0x113e00,%edi
  1048e9:    31 c0                    xor    %eax,%eax

gives

--- Xen Test Framework ---
Environment: HVM 64bit (Long mode 4 levels)
XSA-308 PoC
#DB at 0008:00000000001048df, pending 00004000, spurious 1
#DB at 0008:00000000001048e0, pending 00004000, spurious 2
#DB at 0008:00000000001048e3, pending 00004000, spurious 8
#DB at 0008:00000000001048e4, pending 00004000, spurious 0
Success: Not vulnerable to XSA-308

which suggests that the active code breakpoint in the MovSS shadow is
discarded too, because of no #DB on the 0x1048e2 boundary.

This test is obscured by another bug/misfeature/something where the
B{0..3} get lost on vmexit if BT is also set.

> are mutually exclusive with MOVSS blocking (T-bit task switch),

Howso?  MovSS prevents external interrupts from triggering task
switches, but instruction sources still trigger in a shadow.

> or are
> already handled by KVM (ICEBP, a.k.a. INT1).

Other sources of #DB include RTM debugging, with errata causing a
fault-style #DB pointing at the XBEGIN instruction, rather than
vectoring to the abort handler, and splitlock which is new since I last
thought about this problem.

> This bug was originally found by running tests[1] created for XSA-308[2].
> Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
> presumably why the Xen bug was deemed to be an exploitable DOS from guest
> userspace.

As I recall, the original report to the security team was something
along the lines of "Steam has just updated game, and now when I start
it, the VM explodes".

> KVM already handles ICEBP by skipping the ICEBP instruction
> and thus clears MOVSS blocking as a side effect of its "emulation".
>
> [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html

This URL is at the whim of doxygen and not necessarily stable.

https://xenbits.xen.org/gitweb/?p=xtf.git;a=blob;f=tests/xsa-308/main.c
ought to have better longevity, as well as including test description.

~Andrew