Re: [RFC] Circumventing FineIBT Via Entrypoints

From: Andrew Cooper
Date: Wed Feb 12 2025 - 21:42:22 EST


On 13/02/2025 2:09 am, Jann Horn wrote:
> On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>>>> Assuming this is an issue you all feel is worth addressing, I will
>>>> continue working on providing a patch. I'm concerned though that the
>>>> overhead from adding a wrmsr on both syscall entry and exit to
>>>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
>>>> any feedback in regards to the approach or suggestions of alternate
>>>> approaches to patching are welcome :)
>>> Since the kernel, as far as I understand, uses FineIBT without
>>> backwards control flow protection (in other words, I think we assume
>>> that the kernel stack is trusted?),
>> This is fun indeed. Linux cannot use supervisor shadow stacks because
>> the mess around NMI re-entrancy (and IST more generally) requires ROP
>> gadgets in order to function safely. Implementing this with shadow
>> stacks active, while not impossible, is deemed to be prohibitively
>> complicated.
>>
>> Linux's supervisor shadow stack support is waiting for FRED support,
>> which fixes both the NMI re-entrancy problem, and other exceptions
>> nesting within NMIs, as well as prohibiting the use of the SWAPGS
>> instruction as FRED tries to make sure that the correct GS is always in
>> context.
>>
>> But, FRED support is slated for PantherLake/DiamondRapids which haven't
>> shipped yet, so are no use to the problem right now.
>>
>>> could we build a cheaper
>>> check on that basis somehow? For example, maybe we could do something like:
>>>
>>> ```
>>> endbr64
>>> test rsp, rsp
>>> js slowpath
>>> swapgs
>>> ```
>> I presume it's been pointed out already, but there are 3 related
>> entrypoints here. SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
>> which are related.
>>
>> But, any other IDT entry is in a similar bucket. If we're corrupting a
>> function pointer or return address to redirect here, then the check of
>> CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
>> stack frame.
>>
>> For IDT entries, checking %rsp is reasonable, because userspace can't
>> forge a kernel-like %rsp. However, SYSCALL64 specifically leaves %rsp
>> entirely attacker controlled (and even potentially non-canonical), so
>> I'm wondering what you hand in mind for the slowpath to truly
>> distinguish kernel context from user context?
> Hm, yeah, that seems hard - maybe the best we could do is to make sure
> that the inactive gsbase has the correct value for our CPU's kernel
> gsbase? Kinda like a paranoid_entry, except more painful because we'd
> first have to figure out a place to spill registers to before we can
> start using stuff like rdmsr... Then a function pointer overwrite
> might still turn into returning to userspace with a sysret with GPRs
> full of kernel pointers, but at least we wouldn't run off of a bogus
> gsbase anymore?

Thinking about this some more, I think it's impossible to distinguish.

One of the many sharp edges of SYSCALL (and SYSENTER for that matter) is
that they're instructions expected to be only be used by userspace, but
that be executed in supervisor too[1].  They're asymmetric with their
SYSRET (and SYSEXIT) counterparts which are CPL0 instructions that
strictly transition into CPL3.

The SYSCALL behaviour TLDR is:

    %rcx = %rip
    %r11 = %eflags
    %cs = fixed attr
    %ss = fixed attr
    %rip = MSR_LSTAR

which means that %rcx (old rip) is the only piece of state which
userspace can't feasibly forge (and therefore could distinguish a
SYSCALL from user vs kernel mode), yet if we're talking about a JOP
chain to get here, then %rcx is under attacker control too.

There are a variety of solutions to this problem that involve not using
%gs for per-cpu data.  I also expect that to be wholly unpopular and
dismissed as an approach.

~Andrew

[1] No-one back then was brave enough to design CPL3-only instructions.