RE: [PATCH v3 05/10] x86/ibt: Optimize FineIBT sequence

From: Constable, Scott D
Date: Thu Feb 20 2025 - 13:29:31 EST


Hi Andrew,

I can elaborate, if only "a bit." Your intuition about branches is pretty accurate, and the difference between taken vs. not-taken should, on average, be marginal. I can quote from Intel's software optimization manual: "Conditional branches that are never taken do not consume BTB resources." Additionally, there are some more subtle reasons that not-taken branches can be preferable--these vary by microarchitecture.

Regards,

Scott Constable

-----Original Message-----
From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
Sent: Wednesday, February 19, 2025 9:15 AM
To: Peter Zijlstra <peterz@xxxxxxxxxxxxx>; x86@xxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx; Milburn, Alyssa <alyssa.milburn@xxxxxxxxx>; Constable, Scott D <scott.d.constable@xxxxxxxxx>; joao@xxxxxxxxxxxxxxxxxx; jpoimboe@xxxxxxxxxx; jose.marchesi@xxxxxxxxxx; hjl.tools@xxxxxxxxx; ndesaulniers@xxxxxxxxxx; samitolvanen@xxxxxxxxxx; nathan@xxxxxxxxxx; ojeda@xxxxxxxxxx; kees@xxxxxxxxxx; alexei.starovoitov@xxxxxxxxx; mhiramat@xxxxxxxxxx; jmill@xxxxxxx
Subject: Re: [PATCH v3 05/10] x86/ibt: Optimize FineIBT sequence

On 19/02/2025 4:21 pm, Peter Zijlstra wrote:
> Scott notes that non-taken branches are faster. Abuse overlapping code
> that traps instead of explicit UD2 instructions.
>
> And LEA does not modify flags and will have less dependencies.
>
> Suggested-by: Scott Constable <scott.d.constable@xxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>

Can we get a bit more info on this "non-taken branches are faster" ?

For modern cores which have branch prediction pre-decode, a branch unknown to the predictor will behave as non-taken until the Jcc executes[1].

Something size of Linux is surely going to exceed the branch predictor capacity, so it's perhaps fair to say that there's a reasonable chance to miss in the predictor.

But, for a branch known to the predictor, taken branches ought to be bubble-less these days.  At least, this is what the marketing material claims.

And, this doesn't account for branches which alias in the predictor and end up with a wrong prediction.

~Andrew

[1] Yes, I know RWC has the reintroduced 0xee prefix with the decode resteer.