[PATCH] x86/kcfi: Optimize call sequence
From: Peter Zijlstra
Date: Fri Jun 12 2026 - 03:17:11 EST
As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:
movl $(-hash), %r10d
addl -15(%r11), %r10d
je 1f
ud2
1: cs call __x86_indirect_thunk_r11
(when used in conjunction with -mretpoline-external-thunk).
Notably, the Jcc here is always taken, resulting in lower throughput than would
be ideal. Replace it with the following sequence on boot:
movl $(-hash), %r10d
addl -15(%r11), %r10d
jne . + 3
test $0xd6, %al
cs call __x86_indirect_thunk_r11
This jumps to the UDB instruction used as an immediate byte in the test
instruction. The test instruction will clobber eflags, but that is immaterial,
eflags is already changed by the preceding addl.
Intel recommends the FineIBT sequence on platforms that support IBT; older
platforms are still widely used and would benefit from this.
An earlier PoC was benchmarked by Scott:
Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)
BHI_DIS_S=1
Benchmark Baseline IBT kCFI kCFI-opt
-----------------------------------------------------------------------------
iperf3 UDP 0.103764 0.103180 0.104311 0.102945
hackbench 0.000885 0.000876 0.001996 0.000826
lmbench syscall 0.005089 0.004486 0.016990 0.005852
lmbench fork+exit 0.018454 0.019176 0.031085 0.015153
lmbench fork+exec 0.017147 0.021613 0.029129 0.016337
redis 0.032220 0.032655 0.045540 0.027946
nginx+wrk 0.109033 0.112765 0.132557 0.102417
fio randread 0.009704 0.009620 0.008548 0.000962
fio seqwrite 0.006927 0.006707 0.019372 0.004590
kbuild 0.056748 0.057324 0.064640 0.048136
BHI_DIS_S=0
Benchmark Baseline IBT kCFI kCFI-opt
-----------------------------------------------------------------------------
iperf3 UDP 0.000077 0.000106 0.000186 0.000073
hackbench 0.000123 0.000132 0.000367 0.000097
lmbench syscall 0.023259 0.018319 0.040903 0.012772
lmbench fork+exit 0.011494 0.011887 0.029079 0.016415
lmbench fork+exec 0.037782 0.038994 0.055378 0.026381
redis 0.002481 0.003152 0.017073 0.000184
nginx+wrk 0.015478 0.016266 0.033637 0.000268
fio randread 0.009836 0.007949 0.007096 0.000143
fio seqwrite 0.014587 0.014165 0.041792 0.002157
kbuild 0.055774 0.055249 0.062590 0.046546
Cc: Sami Tolvanen <samitolvanen@xxxxxxxxxx>
Cc: Kees Cook <kees@xxxxxxxxxx>
Cc: Nathan Chancellor <nathan@xxxxxxxxxx>
Cc: hpa@xxxxxxxxxx
Suggested-by: Scott D Constable <scott.d.constable@xxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
arch/x86/kernel/alternative.c | 11 ++++++++++-
arch/x86/kernel/cfi.c | 6 ++++++
2 files changed, 16 insertions(+), 1 deletion(-)
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
* "Make conditional jumps most often not taken: The efficiency and throughput
* for not-taken branches is better than for taken branches on most
* processors. Therefore, it is good to place the most frequent branch first"
+ *
+ * NOTE: Update the kCFI caller sequence to make use of this observation.
+ * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
+ * clobbers flags, but those are clobbered by the hash test anyway.
*/
/*
@@ -1518,9 +1522,10 @@ static int cfi_disable_callers(s32 *star
static int cfi_enable_callers(s32 *start, s32 *end)
{
/*
- * Re-enable kCFI, undo what cfi_disable_callers() did.
+ * Re-enable (and update) kCFI, undo what cfi_disable_callers() did.
*/
const u8 mov[] = { 0x41, 0xba };
+ const u8 udne[] = { 0x75, 0x01, 0xa8, 0xd6 };
s32 *s;
for (s = start; s < end; s++) {
@@ -1532,6 +1537,10 @@ static int cfi_enable_callers(s32 *start
if (!hash) /* nocfi callers */
continue;
+ /*
+ * See the kCFI/FineIBT comment above -- update note.
+ */
+ text_poke_early(addr + 10, udne, 4);
text_poke_early(addr, mov, 2);
}
--- a/arch/x86/kernel/cfi.c
+++ b/arch/x86/kernel/cfi.c
@@ -72,6 +72,12 @@ enum bug_trap_type handle_cfi_failure(st
switch (cfi_mode) {
case CFI_KCFI:
+ /*
+ * The updated kCFI sequence has "test $0xd6, %al" instead of
+ * "ud2", adjust the offset.
+ */
+ addr -= 1;
+
if (!is_cfi_trap(addr))
return BUG_TRAP_TYPE_NONE;