Re: [RFC][PATCH 6/6] objtool: Add IBT validation / fixups
From: Peter Zijlstra
Date: Fri Feb 11 2022 - 08:38:24 EST
On Tue, Feb 08, 2022 at 09:18:44PM -0800, Joao Moreira wrote:
> > Ah, excellent, thanks for the pointers. There's also this in the works:
> > https://reviews.llvm.org/D119296 (a new CFI mode, designed to play nice
> > to objtool, IBT, etc.)
>
> Oh, great! Thanks for pointing it out. I guess I saw something with a
> similar name before ;) https://www.blackhat.com/docs/asia-17/materials/asia-17-Moreira-Drop-The-Rop-Fine-Grained-Control-Flow-Integrity-For-The-Linux-Kernel.pdf
>
> Jokes aside (and perhaps questions more targeted to Sami), from a diagonal
> look it seems that this follows the good old tag approach proposed by
> PaX/grsecurity, right? If this is the case, should I assume it could also
> benefit from features like -mibt-seal? Also are you considering that perhaps
> we can use alternatives to flip different CFI instrumentation as suggested
> by PeterZ in another thread?
So, lets try and recap things from IRC yesterday. There's a whole bunch
of things intertwining making indirect branches 'interesting'. Most of
which I've not seen mentioned in Sami's KCFI proposal which makes it
rather pointless.
I think we'll end up with something related to KCFI, but with distinct
differences:
- 32bit immediates for smaller code
- __kcfi_check_fail() is out for smaller code
- it must interact with IBT/BTI and retpolines
- we must be very careful with speculation.
Right, so because !IBT-CFI needs the check at the call site, like:
caller:
cmpl $0xdeadbeef, -0x4(%rax) # 7 bytes
je 1f # 2 bytes
ud2 # 2 bytes
1: call __x86_indirect_thunk_rax # 5 bytes
.align 16
.byte 0xef, 0xbe, 0xad, 0xde # 4 bytes
func:
...
ret
While FineIBT has them at the landing site:
caller:
movl $0xdeadbeef, %r11d # 6 bytes
call __x86_indirect_thunk_rax # 5 bytes
.align 16
func:
endbr # 4 bytes
cmpl $0xdeadbeef, %r11d # 7 bytes
je 1f # 2 bytes
ud2 # 2 bytes
1: ...
ret
It seems to me that always doing the check at the call-site is simpler,
since it avoids code-bloat and patching work. That is, if we allow both
we'll effectivly blow up the code by 11 + 13 bytes (11 at the call site,
13 at function entry) as opposed to 11+4 or 6+13.
Which then yields:
caller:
cmpl $0xdeadbeef, -0x4(%rax) # 7 bytes
je 1f # 2 bytes
ud2 # 2 bytes
1: call __x86_indirect_thunk_rax # 5 bytes
.align 16
.byte 0xef, 0xbe, 0xad, 0xde # 4 bytes
func:
endbr # 4 bytes
...
ret
For a combined 11+8 bytes overhead :/
Now, this setup provides:
- minimal size (please yell if there's a smaller option I missed;
s/ud2/int3/ ?)
- since the retpoline handles speculation from stuff before it, the
load-miss induced speculation is covered.
- the 'je' branch is binary, leading to either the retpoline or the
ud2, both which are a speculation stop.
- the ud2 is placed such that if the exception is non-fatal, code
execution can recover
- when IBT is present we can rewrite the thunk call to:
lfence
call *(%rax)
and rely on the WAIT-FOR-ENDBR speculation stop (also 5 bytes).
- can disable CFI by replacing the cmpl with:
jmp 1f
(or an 11 byte nop, which is just about possible). And since we
already have all retpoline thunk callsites in a section, we can
trivially find all CFI bits that are always in front it them.
- function pointer sanity
Additionally, if we ensure all direct call are +4 and only indirect
calls hit the ENDBR -- as it optimal anyway, saves on decoding ENDBR. We
can replace those ENDBR instructions of functions that should never be
indirectly called with:
ud1 0x0(%rax),%eax
which is a 4 byte #UD. This gives us the property that even on !IBT
hardware such a call will go *splat*.
Further, Andrew put in the request for __attribute__((cfi_seed(blah)))
to allow distinguishing indirect functions with otherwise identical
signature; eg. cookie = hash32(blah##signature).
Did I miss anything? Got anything wrong?