Re: [PATCH RFC 4/4] x86/srso: Use CALL-based return thunks to reduce overhead

From: Nikolay Borisov
Date: Wed Aug 23 2023 - 02:08:26 EST

Next message: Michael Ellerman: "Re: linux-next: manual merge of the tty tree with the powerpc tree"
Previous message: Borislav Petkov: "Re: [PATCH 06/22] x86/srso: Print actual mitigation if requested mitigation isn't possible"
In reply to: Josh Poimboeuf: "Re: [PATCH RFC 4/4] x86/srso: Use CALL-based return thunks to reduce overhead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 23.08.23 г. 1:18 ч., Josh Poimboeuf wrote:

On Tue, Aug 22, 2023 at 09:45:07AM +0300, Nikolay Borisov wrote:

On 22.08.23 г. 5:22 ч., Josh Poimboeuf wrote:

On Tue, Aug 22, 2023 at 12:01:29AM +0100, Andrew Cooper wrote:

On 21/08/2023 4:16 pm, Josh Poimboeuf wrote:

On Mon, Aug 21, 2023 at 12:27:23PM +0100, Andrew Cooper wrote:

The SRSO safety depends on having a CALL to an {ADD,LEA}/RET sequence which
has been made safe in the BTB. Specifically, there needs to be no pertubance
to the RAS between a correctly predicted CALL and the subsequent RET.

Use the new infrastructure to CALL to a return thunk. Remove
srso_fam1?_safe_ret() symbols and point srso_fam1?_return_thunk().

This removes one taken branch from every function return, which will reduce
the overhead of the mitigation. It also removes one of three moving pieces
from the SRSO mess.

So, the address of whatever instruction comes after the 'CALL
srso_*_return_thunk' is added to the RSB/RAS, and that might be
speculated to when the thunk returns. Is that a concern?

That is very intentional, and key to the safety.

Replacing a RET with a CALL/{ADD,LEA}/RET sequence is a form of
retpoline thunk. The only difference with regular retpolines is that
the intended target is already on the stack, and not in a GPR.

If the CALL mispredicts, it doesn't matter. When decode catches up
(allegedly either instantaneously on Fam19h, or a few cycles late on
Fam17h), the top of the RAS is corrected will point at the INT3
following the CALL instruction.

That's the thing though, at least with my kernel/compiler combo there's
no INT3 after the JMP __x86_return_thunk, and there's no room to patch
one in after the CALL, as the JMP and CALL are both 5 bytes.

FWIW gcc's mfunction-return=thunk-return only ever generates a jmp,
thunk/thunk-inline OTOH generates a "full fledged" thunk with all the
necessary speculation catching tricks.

For reference:

https://godbolt.org/z/M1avYc63b

The problem is the call-site, not the thunk. Ideally we'd have an
option which adds an INT3 after the 'JMP __x86_return_thunk'.

The way I see it, it seems the int3/ud2 or w/e sequence belongs to the thunk and not the call site (what you said). However, Andrew's solution depends on the callsite sort of being the thunk.

It seems something like that has already been done for the indirect thunk but not for return thunk:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102952

Next message: Michael Ellerman: "Re: linux-next: manual merge of the tty tree with the powerpc tree"
Previous message: Borislav Petkov: "Re: [PATCH 06/22] x86/srso: Print actual mitigation if requested mitigation isn't possible"
In reply to: Josh Poimboeuf: "Re: [PATCH RFC 4/4] x86/srso: Use CALL-based return thunks to reduce overhead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]