RE: [patch 00/38] x86/retbleed: Call depth tracking mitigation

From: David Laight
Date: Sun Jul 17 2022 - 05:45:29 EST


From: Thomas Gleixner
> Sent: 17 July 2022 00:17
> Folks!
>
> Back in the good old spectre v2 days (2018) we decided to not use
> IBRS. In hindsight this might have been the wrong decision because it did
> not force people to come up with alternative approaches.
>
> It was already discussed back then to try software based call depth
> accounting and RSB stuffing on underflow for Intel SKL[-X] systems to avoid
> the insane overhead of IBRS.
>
> This has been tried in 2018 and was rejected due to the massive overhead
> and other shortcomings of the approach to put the accounting into each
> function prologue:
>
> 1) Text size increase which is inflicted on everyone. While CPUs are
> good in ignoring NOPs they still pollute the I-cache.
>
> 2) That results in tail call over-accounting which can be exploited.
>
> Disabling tail calls is not an option either and adding a 10 byte padding
> in front of every direct call is even worse in terms of text size and
> I-cache impact. We also could patch calls past the accounting in the
> function prologue but that becomes a nightmare vs. ENDBR.
>
> As IBRS is a performance horror show, Peter Zijstra and me revisited the
> call depth tracking approach and implemented it in a way which is hopefully
> more palatable and avoids the downsides of the original attempt.
>
> We both unsurprisingly hate the result with a passion...
>
> The way we approached this is:
>
> 1) objtool creates a list of function entry points and a list of direct
> call sites into new sections which can be discarded after init.
>
> 2) On affected machines, use the new sections, allocate module memory
> and create a call thunk per function (16 bytes without
> debug/statistics). Then patch all direct calls to invoke the thunk,
> which does the call accounting and then jumps to the original call
> site.
>
> 3) Utilize the retbleed return thunk mechanism by making the jump
> target run-time configurable. Add the accounting counterpart and
> stuff RSB on underflow in that alternate implementation.

What happens to indirect calls?
The above would imply that they miss the function entry thunk, but
get the return one.
Won't this lead to mis-counting of the RSB?

I also thought that retpolines would trash the return stack?
Using a single retpoline thunk would pretty much ensure that
they are never correctly predicted from the BTB, but it only
gives a single BTB entry that needs 'setting up' to get mis-
prediction.

I'm also sure I managed to infer from a document of instruction
timings and architectures that some x86 cpu actually used the BTB
for normal conditional jumps?
Possibly to avoid passing the full %ip value all down the
cpu pipeline.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)