Re: [patch 00/38] x86/retbleed: Call depth tracking mitigation

From: Peter Zijlstra
Date: Wed Jul 20 2022 - 17:14:13 EST


On Tue, Jul 19, 2022 at 10:19:18AM -0700, Sami Tolvanen wrote:

> Clang's current CFI implementation is somewhat similar to this. It
> creates separate thunks for address-taken functions and changes
> function addresses in C code to point to the thunks instead.
>
> While this works, it creates painful situations when interacting with
> assembly (e.g. a function address taken in assembly cannot be used
> for indirect calls in C as it doesn't point to the thunk) and needs
> unpleasant hacks when we want take the actual function address in C
> (i.e. scattering the code with function_nocfi() calls).
>
> I have to agree with Peter on this, I would rather avoid messing with
> function pointers in KCFI to avoid these issues.

It is either this; and I think I can avoid the worst of it (see below);
or grow the indirect_callsites to obscure the immediate (as Linus
suggested), there's around ~16k indirect callsites in a defconfig-ish
kernel, so growing it isn't too horrible, but it isn't nice either.

The prettiest option to obscure the immediate at the callsite I could
conjure up is something like:

kcfi_caller_linus:
movl $0x12345600, %r10d
movb $0x78, %r10b
cmpl %r10d, -OFFSET(%r11)
je 1f
ud2
1: call __x86_thunk_indirect_r11

Which comes to around 22 bytes (+5 over the original).

Joao suggested putting part of that in the retpoline thunk like:

kcfi_caller_joao:
movl $0x12345600, %r10d
movb $0x78, %r10b
call __x86_thunk_indirect_cfi

__x86_thunk_indirect_cfi:
cmpl %r10d, -OFFSET(%r11)
je 1f
ud2
1:
call 1f
int3
1:
mov %r11, (%rsp)
ret
int3

The only down-side there is that eIBRS hardware doesn't need retpolines
(given we currently default to ignoring Spectre-BHB) and as such this
doesn't really work nicely (we don't want to re-introduce funneling).


The other option I came up with, alluded to above, is below, and having
written it out, I'm pretty sure I faviour just growing the indirect
callsite as per Linus' option above.

Suppose:

indirect_callsite:
cmpl $0x12345678, -6(%r11) # 8
je 1f # 2
ud2 # 2
call __x86_indirect_thunk_r11 # 5 (-> .retpoline_sites)


__cfi_\func:
movl $0x12345678, %eax # 5
int3 # 1
int3 # 1
\func: # aligned 16
endbr # 4
nop12 # 12
call __fentry__ # 5
...


And for functions that do not get their address taken:


\func: # aligned 16
nop16 # 16
call __fentry__ # 5
...



Instead, extend the objtool .call_sites to also include tail-calls and
for:

- regular (!SKL, !IBT) systems;
* patch all direct calls/jmps to +16 (.call_sites)
* static_call/ftrace/etc.. can triviall add the +16
* retpolines can do +16 for the indirect calls
* retutn thunks are patched to ret;int3 (.return_sites)

(indirect calls for eIBRS which don't use retpoline
simply eat the nops)


- SKL systems;
* patch the first 16 bytes into:

nop6
sarq $5, PER_CPU_VAR(__x86_call_depth)

* patch all direct calls to +6 (.call_sites)
* patch all direct jumps to +16 (.call_sites)
* static_call/ftrace adjust to +6/+16 depending on instruction type
* retpolines are split between call/jmp and do +6/+16 resp.
* return thunks are patches to x86_return_skl (.return_sites)


- IBT systes;
* patch the first 16 bytes to:

endbr # 4
xorl $0x12345678, %r10d # 7
je 1f # 2
ud2 # 2
nop # 1
1:

* patch the callsites to: (.retpoline_sites)

movl $0x12345678, %r10d # 7
call *$r11 # 3
nop7 # 7

* patch all the direct calls/jmps to +16 (.call_sites)
* static_call/ftrace/etc.. add +16
* retutn thunks are patched to ret;int3 (.return_sites)


Yes, frobbing the address for static_call/ftrace/etc.. is a bit
horrible, but at least &sym remains exactly that address and not
something magical.

Note: It is possible to shift the __fentry__ call, but that would mean
that we loose alignment or get to carry .call_sites at runtime (and it
is *huge*)