Re: [PATCH 1/3] x86-64: Convert the PDA to percpu.

From: Brian Gerst
Date: Sat Dec 27 2008 - 12:16:49 EST


On Sat, Dec 27, 2008 at 10:53 AM, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> * Brian Gerst <brgerst@xxxxxxxxx> wrote:
>
>> On Sat, Dec 27, 2008 at 5:41 AM, Ingo Molnar <mingo@xxxxxxx> wrote:
>> >
>> > (Cc:-ed a few more people who might be interested in this)
>> >
>> > * Brian Gerst <brgerst@xxxxxxxxx> wrote:
>> >
>> >> This patch makes the PDA a normal per-cpu variable, allowing the
>> >> removal of the special allocator code. %gs still points to the
>> >> base of the PDA.
>> >>
>> >> Tested on a dual-core AMD64 system.
>> >>
>> >> Signed-off-by: Brian Gerst <brgerst@xxxxxxxxx>
>> >> ---
>> >> arch/x86/include/asm/pda.h | 3 --
>> >> arch/x86/include/asm/percpu.h | 3 --
>> >> arch/x86/include/asm/setup.h | 1 -
>> >> arch/x86/kernel/cpu/common.c | 6 ++--
>> >> arch/x86/kernel/dumpstack_64.c | 8 ++--
>> >> arch/x86/kernel/head64.c | 23 +------------
>> >> arch/x86/kernel/irq.c | 2 +-
>> >> arch/x86/kernel/nmi.c | 2 +-
>> >> arch/x86/kernel/setup_percpu.c | 70 ++++++++--------------------------------
>> >> arch/x86/kernel/smpboot.c | 58 +--------------------------------
>> >> arch/x86/xen/enlighten.c | 2 +-
>> >> arch/x86/xen/smp.c | 12 +------
>> >> 12 files changed, 27 insertions(+), 163 deletions(-)
>> >
>> > the simplification factor is significant. I'm wondering, have you measured
>> > the code size impact of this on say the defconfig x86 kernel? That will
>> > generally tell us how much worse optimizations the compiler does under
>> > this scheme.
>> >
>> > Ingo
>> >
>>
>> Patch #1 by itself doesn't change how the PDA is accessed, only how it
>> is allocated. The text size goes down significantly with patch #1,
>> but data goes up. Changing the PDA to cacheline-aligned (1a) brings
>> it back in line.
>>
>> text data bss dec hex filename
>> 7033648 1754476 758508 9546632 91ab88 vmlinux.0 (vanilla 2.6.28)
>> 7029563 1758428 758508 9546499 91ab03 vmlinux.1 (with patch #1)
>> 7029563 1754460 758508 9542531 919b83 vmlinux.1a (with patch #1 cache align)
>> 7036694 1758428 758508 9553630 91c6de vmlinux.3 (with all three patches)
>>
>> I think the first patch (with the alignment fix) is a clear win. As for
>> the other patches, they add about 8 bytes per use of a PDA variable.
>> cpu_number is used 903 times in this compile, so this is likely the most
>> extreme example. I have an idea to optimize this particular case
>> further that I'd like to look at which would lessen the impact.
>
> curious, what idea is that?
>
> Ingo
>

Something like this:
+#define raw_smp_processor_id() \
+({ \
+ extern int gsoff__cpu_number; \
+ int cpu; \
+ __asm__("movl %%gs:%1, %0" : "=r" (cpu) \
+ : "m" (gsoff__cpu_number); \
+ cpu; \
+})

And add this to vmlinux_64.lds.S:
+#define GSOFF(x) gsoff__##x = per_cpu__##x - per_cpu__pda
+ GSOFF(cpu_number);

The trick is that the linker can calculate against multiple symbols,
but it must be done in the final link. The problem with this approach
is that only a limited set of symbols can be used. There isn't a
simple solution for all per-cpu variables. Some post-processing would
have to be done, similar to kallsyms.

Looking some more at the usage statistics of the PDA members, there
are four heavy hitters:
pda->pcurrent (2719)
pda->kernelstack (1055)
pda->cpunumber (933)
pda->data_offset (327)

The rest of the PDA members have an insignificant number of accesses.
I think for now I'll avoid converting the above four fields until an
optimal solution can be agreed on, but the others (primarily the TLB
and irqstat fields) can be converted without bloating the kernel code
alot.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/