Re: <linux/linkage.h> generates incorrect cache alignments for 486and above

From: Chris Sears (cbsears@ix.netcom.com)
Date: Thu Jan 27 2000 - 14:45:58 EST


Mike,

> The 16 should be 32 as cache.h points out.

> I don't think so. (but am not positive)

The value should 4, 16 or the cacheline size. From Agner Fog's
How To Optimize For The Pentium,

        Instruction Fetch (PPro and PII)

        The code is fetched in aligned 16-bytes chunks from the code
cache

        and placed in the double buffer, which is called so because it
can

        contain two such chunks. The code is then taken from the double
buffer
        and fed to the decoders in blocks which are usually 16 bytes
long,

        but not necessarily aligned by 16. I will call these blocks
ifetch
blocks
        (instruction fetch blocks). If an ifetch block crosses a 16 byte

boundary
        in the code then it needs to take from both chunks in the double

buffer.
        So the purpose of the double buffer is to allow instruction
fetching across
        16 byte boundaries.

So this is where the 16 comes from. These ifetch blocks still get
fetched

from the cache. Ya wanna see a bad example? This is from entry.S

ENTRY(system_call)
        ...
        movl %eax,EAX(%esp) # save the return value
        ALIGN
        .globl ret_from_sys_call
        .globl ret_from_intr
ret_from_sys_call:
        movl SYMBOL_NAME(bh_mask),%eax
        ...

This is the output from readelf -i 1 entry.o

        0x000000f4 movl %eax,0x18(%esp)
        0x000000f8 nop
        0x000000f9 leal 0x0(%esi),%esi
        0x00000100 movl 0x00000000,%eax
        0x00000105 andl 0x00000000,%eax

What is this "leal" junk? That there is one very large nop.
The price of the alignment is two nop instructions.
If ALIGN were set to the cache line size it would be the same
because system_call to the ALIGN is very near two cache lines
already.

This one is a tough call. Two nops in the straightline code vs
mis-alignment in the shared code. Ok, one nop,
they would be paired in the UV pipeline. Probably leave it be.

> I don't know what the pIII prefers, but I did 5 test runs with lmbench

> today on my p5 with 0,4,16 and 2 additionals with different alignment
> CFLAGS. The difference in lmbench numbers was ~nil for all 5 runs.
>
> The only difference at all (gcc-2.95-2) was that -malign-[foo]=0, gave

> me a 48k ram rebate. My p5 doesn't care about __ALIGN.. and I get
free
> ram by generally not aligning. :-)
>
> I haven't tested my pIII yet to see if it cares, and if so, what it
> prefers. Someone else probably has.

You are talking about gcc alignment which is different.
gcc alignment probably isn't worthwhile: your tests seem to prove this.

If you look at the assembly output you will see .align 32 everywhere.
I can definitely argue for alignment of entry points, but that doesn't
seem to be a gcc option. I wish it were. Alignment everywhere
for everything can oversubscribe the cache.

But that is alignment of an entire program (linux) versus alignment
of a few frequently used assembly routines and data. Some of these
routines run with interrupts disabled. This isn't the biggest thing
in the world but it is worth doing right.

Chris Sears
cbsears@ix.netcom.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 31 2000 - 21:00:19 EST