Re: [PATCH] mm/sparse: Optimize section number calculations using bit shifts
From: David Hildenbrand (Arm)
Date: Tue Jun 16 2026 - 04:07:21 EST
On 6/16/26 04:59, Zhen Ni wrote:
> Add SECTIONS_PER_ROOT_SHIFT = ilog2(SECTIONS_PER_ROOT) with correctness
> guaranteed by BUILD_BUG_ON in sparse_init(). Convert SECTION_NR_TO_ROOT
> to use right shift instead of division for better performance. Add
> SECTION_NR_IN_ROOT() macro to improve code readability.
>
> This improves code efficiency in hot paths where __nr_to_section() is
> frequently called, such as sparse_init() and memory section management
> operations.
>
> Performance verification in sparse_init() on ARM (8GB RAM, 4 NUMA nodes):
>
> sparse_init()
> |
> +----> memblocks_present()
> |
> +----> section initialization (sparse_init_nid loop)
>
> Time measurement points:
>
> [T1] sparse_init start
> |
> v
> [T2] memblocks_present() complete
> |
> v
> [T3] sparse_init_nid() loop complete / sparse_init end
>
> Measurement values:
> memblocks_present_cycles = T2 - T1
> section_initialization_cycles = T3 - T2
> total_cycles = T3 - T1
>
> Before (division):
> [ 0.000000] sparse_init: total 7538 cycles
> [ 0.000000] memblocks_present: 4232 cycles
> [ 0.000000] section initialization: 3261 cycles
>
> After (bit shift):
> [ 0.000000] sparse_init: total 5641 cycles
> [ 0.000000] memblocks_present: 3562 cycles
> [ 0.000000] section initialization: 2057 cycles
>
> Performance improvement:
> Total: (7538-5641)/7538 = 25.2% faster
> memblocks_present: (4232-3562)/4232 = 15.8% faster
> section initialization: (3261-2057)/3261 = 36.9% faster
>
> Signed-off-by: Zhen Ni <zhen.ni@xxxxxxxxxxxx>
> ---
> include/linux/mmzone.h | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9adb2ad21da5..5daf471f6823 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -2035,11 +2035,14 @@ struct mem_section {
>
> #ifdef CONFIG_SPARSEMEM_EXTREME
> #define SECTIONS_PER_ROOT (PAGE_SIZE / sizeof (struct mem_section))
> +#define SECTIONS_PER_ROOT_SHIFT ilog2(SECTIONS_PER_ROOT)
> #else
> #define SECTIONS_PER_ROOT 1
> +#define SECTIONS_PER_ROOT_SHIFT 0
> #endif
>
> -#define SECTION_NR_TO_ROOT(sec) ((sec) / SECTIONS_PER_ROOT)
> +#define SECTION_NR_TO_ROOT(sec) ((sec) >> SECTIONS_PER_ROOT_SHIFT)
> +#define SECTION_NR_IN_ROOT(sec) ((sec) & SECTION_ROOT_MASK)
> #define NR_SECTION_ROOTS DIV_ROUND_UP(NR_MEM_SECTIONS, SECTIONS_PER_ROOT)
> #define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1)
>
>From a compiler POV, "/ SECTIONS_PER_ROOT" is exactly the same as >>
ilog2(SECTIONS_PER_ROOT) *as long as* the variable we are processing is
"unsigned long".
The compiler should be smart enough to figure out that out with
SECTIONS_PER_ROOT being known at compiletime.
Can you compare the generated code __nr_to_section() to see if and why the
compiler fails to optimize that?
--
Cheers,
David