Re: [RFC PATCH 06/11] x86: make sure _etext includes function sections

From: Jann Horn
Date: Thu Feb 06 2020 - 08:16:00 EST


On Thu, Feb 6, 2020 at 1:26 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> I know x86_64 stack alignment is 16 bytes.

That's true for the standard sysv ABI that is used in userspace; but
the kernel uses a custom ABI with 8-byte stack alignment. See
arch/x86/Makefile:

# For gcc stack alignment is specified with -mpreferred-stack-boundary,
# clang has the option -mstack-alignment for that purpose.
ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
cc_stack_align4 := -mpreferred-stack-boundary=2
cc_stack_align8 := -mpreferred-stack-boundary=3
else ifneq ($(call cc-option, -mstack-alignment=16),)
cc_stack_align4 := -mstack-alignment=4
cc_stack_align8 := -mstack-alignment=8
endif
[...]
# By default gcc and clang use a stack alignment of 16 bytes for x86.
# However the standard kernel entry on x86-64 leaves the stack on an
# 8-byte boundary. If the compiler isn't informed about the actual
# alignment it will generate extra alignment instructions for the
# default alignment which keep the stack *mis*aligned.
# Furthermore an alignment to the register width reduces stack usage
# and the number of alignment instructions.
KBUILD_CFLAGS += $(call cc-option,$(cc_stack_align8))

> I cannot find evidence for
> what function start alignment should be.

There is no architecturally required alignment for functions, but
Intel's Optimization Manual
(<https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf>)
recommends in section 3.4.1.5, "Code Alignment":

| Assembly/Compiler Coding Rule 12. (M impact, H generality)
| All branch targets should be 16-byte aligned.

AFAIK this is recommended because, as documented in section 2.3.2.1,
"Legacy Decode Pipeline" (describing the frontend of Sandy Bridge, and
used as the base for newer microarchitectures):

| An instruction fetch is a 16-byte aligned lookup through the ITLB
and into the instruction cache.
| The instruction cache can deliver every cycle 16 bytes to the
instruction pre-decoder.

AFAIK this means that if a branch ends close to the end of a 16-byte
block, the frontend is less efficient because it may have to run two
instruction fetches before the first instruction can even be decoded.