Re: [PATCH] x86: Align jump targets to 1 byte boundaries

From: Denys Vlasenko
Date: Fri Apr 10 2015 - 10:54:42 EST


On 04/10/2015 04:01 PM, Borislav Petkov wrote:
> On Fri, Apr 10, 2015 at 03:54:57PM +0200, Denys Vlasenko wrote:
>> On 04/10/2015 03:19 PM, Borislav Petkov wrote:
>>> On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
>>>> Now, the usual justification for jump target alignment is the
>>>> following: with 16 byte instruction-cache cacheline sizes, if a
>>>
>>> You mean 64 bytes?
>>>
>>> Cacheline size on modern x86 is 64 bytes. The 16 alignment is probably
>>> some branch predictor stride thing.
>>
>> IIRC it's a maximum decode bandwidth. Decoders on the most powerful
>> x86 CPUs, both Intel and AMD, attempt to decode in one cycle
>> up to four instructions. For this they fetch up to 16 bytes.
>
> 32 bytes fetch window per cycle for AMD F15h and F16h, see my other
> mail. And Intel probably do the same.

There are people who experimentally researched this.
According to this guy:

http://www.agner.org/optimize/microarchitecture.pdf

Intel CPUs can decode only up to 16 bytes at a time
(but the have loop buffers and some has uop cache,
which can skip decoding entirely).
AMD CPUs can decode 21 bytes at best. With two cores active,
only 16 bytes.


"""
10 Haswell pipeline
...
10.1 Pipeline
The pipeline is similar to previous designs, but improved with more of everything. It is
designed for a throughput of four instructions per clock cycle.
Each core has a reorder buffer with 192 entries, the reservation station has 60 entries, and
the register file has 168 integer registers and 168 vector registers, according to the literature
listed on page 145 below.
All parts of the pipeline are shared between two threads in those CPU models that can run
two threads in each core. Each thread gets half of the total throughput when two threads are
running in the same core.

10.2 Instruction fetch and decoding
The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle in single
threaded applications.
There are four decoders, which can handle instructions generating up to four Îops per clock
cycle in the way described on page 120 for Sandy Bridge.
Instructions with any number of prefixes are decoded in a single clock cycle. There is no
penalty for redundant prefixes.

...
...

15 AMD Bulldozer, Piledriver and Steamroller pipeline
15.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller
...
15.2 Instruction fetch
The instruction fetcher is shared between the two cores of an execution unit. The instruction
fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache. The
measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
lower than these maximum values when instructions are misaligned.
Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
the first four instructions after a critical label.
"""
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/