Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86

From: Nick Piggin
Date: Mon Feb 23 2009 - 09:09:52 EST


On Tuesday 24 February 2009 00:38:04 Ingo Molnar wrote:
> * Ingo Molnar <mingo@xxxxxxx> wrote:
> > > Are __pa()/__va() that hot paths? Or am I over-estimating
> > > the cost of 2MB dTLB?
> >
> > yes, __pa()/__va() is a very hot path - in a defconfig they
> > are used in about a thousand different places.
> >
> > In fact it would be nice to get rid of the __phys_addr()
> > redirection on the 64-bit side (which is non-linear and a
> > function there, and all __pa()s go through it) and make it a
> > constant offset again.
> >
> > This isnt trivial/possible to do though as .data/.bss is in
> > the high alias. (high .text aliases alone wouldnt be a big
> > issue to fix, but the data aliases are an issue.)
> >
> > Moving .data/.bss into the linear space isnt feasible as we'd
> > lose RIP-relative addressing shortcuts.
> >
> > Maybe we could figure out the places that do __pa() on a high
> > alias and gradually eliminate them. __pa() on .data/.bss is a
> > rare and unusal thing to do, and CONFIG_DEBUG_VIRTUAL could
> > warn about them without crashing the kernel.
> >
> > Later on we could make this check unconditional, and then
> > switch over __pa() to addr-PAGE_OFFSET in the
> > !CONFIG_DEBUG_VIRTUAL case (which is the default).
>
> Ok, i couldnt resist and using ftrace_printk() (regular printk
> in __pa() would hang during bootup) and came up with the patch
> below - which allows the second patch below that does:
>
> -#define __pa(x) __phys_addr((unsigned long)(x))
> +#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
>
> It cuts a nice (and hotly executed) ~650 bytes chunk out of the
> x86 64-bit defconfig kernel text:
>
> text data bss dec hex filename
> 7999071 1137780 843672 9980523 984a6b vmlinux.before
> 7998414 1137780 843672 9979866 9847da vmlinux.after
>
> And it even boots.
>
> (the load_cr3() hack needs to be changed, by setting the init
> pgdir from init_level4_pgt to __va(__pa_symbol(init_level4_pgt).)
>
> (32-bit is untested and likely wont even build.)
>
> It's not even that bad and looks quite maintainable as a
> concept.
>
> This also means that __va() and __pa() will be one and the same
> thing simple arithmetics again on both 32-bit and 64-bit
> kernels.
>
> Ingo
>
> ---
> arch/x86/include/asm/page.h | 4 +++-
> arch/x86/include/asm/page_64_types.h | 1 +
> arch/x86/include/asm/pgalloc.h | 4 ++--
> arch/x86/include/asm/pgtable.h | 2 +-
> arch/x86/include/asm/processor.h | 7 ++++++-
> arch/x86/kernel/setup.c | 12 ++++++------
> arch/x86/mm/init_64.c | 6 +++---
> arch/x86/mm/ioremap.c | 12 +++++++++++-
> arch/x86/mm/pageattr.c | 28 ++++++++++++++--------------
> arch/x86/mm/pgtable.c | 2 +-
> 10 files changed, 48 insertions(+), 30 deletions(-)
>
> Index: linux/arch/x86/include/asm/page.h
> ===================================================================
> --- linux.orig/arch/x86/include/asm/page.h
> +++ linux/arch/x86/include/asm/page.h
> @@ -34,10 +34,11 @@ static inline void copy_user_page(void *
> #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
>
> #define __pa(x) __phys_addr((unsigned long)(x))
> +#define __pa_slow(x) __phys_addr_slow((unsigned long)(x))
> #define __pa_nodebug(x) __phys_addr_nodebug((unsigned long)(x))
> /* __pa_symbol should be used for C visible symbols.
> This seems to be the official gcc blessed way to do such arithmetic. */
> -#define __pa_symbol(x) __pa(__phys_reloc_hide((unsigned long)(x)))
> +#define __pa_symbol(x) __pa_slow(__phys_reloc_hide((unsigned long)(x)))
>
> #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
>
> @@ -49,6 +50,7 @@ static inline void copy_user_page(void *
> * virt_addr_valid(kaddr) returns true.
> */
> #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
> +#define virt_to_page_slow(kaddr) pfn_to_page(__pa_slow(kaddr) >>

Heh. I have almost the exact opposite patch which adds a virt_to_page_fast
and uses it in critical places (in the slab allocator).

But if you can do this more complete conversion, cool. Yes, __pa is very
performance critical (not just code size). Time to alloc+free an object
in the slab allocator is on the order of 100 cycles, so saving a few
cycles here == saving a few %. (although saying that, you hardly ever see
a workload where the slab allocator is too prominent)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/