Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests

From: Jesper Dangaard Brouer
Date: Wed Jan 11 2017 - 08:27:23 EST


On Wed, 11 Jan 2017 13:44:20 +0100
Jesper Dangaard Brouer <brouer@xxxxxxxxxx> wrote:

> On Mon, 9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> > The following is results from a page allocator micro-benchmark. Only
> > order-0 is interesting as higher orders do not use the per-cpu allocator
>
> Micro-benchmarked with [1] page_bench02:
> modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
> rmmod page_bench02 ; dmesg --notime | tail -n 4
>
> Compared to baseline: 213 cycles(tsc) 53.417 ns
> - against this : 184 cycles(tsc) 46.056 ns
> - Saving : -29 cycles
> - Very close to expected 27 cycles saving [see below [2]]

When perf benchmarking I noticed that the "summed" children perf
overhead from calling alloc_pages_current() is 65.05%. Compared to
"free-path" of summed 28.28% of calls "under" __free_pages().

This is caused by CONFIG_NUMA=y, as call path is long with NUMA
(and other helpers are also non-inlined calls):

alloc_pages
-> alloc_pages_current
-> __alloc_pages_nodemask
-> get_page_from_freelist

Without NUMA the call levels gets compacted by inlining to:

__alloc_pages_nodemask
-> get_page_from_freelist

After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
side is more balanced.

Saving by disabling CONFIG_NUMA of:
- CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
- CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
- Saving: : 41 cycles (approx 22%)

I would conclude, there is room for improvements with CONFIG_NUMA code
path case. Lets followup on that in a later patch series...


> > Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> > Acked-by: Hillf Danton <hillf.zj@xxxxxxxxxxxxxxx>
>
> Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
>
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> -
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
>
> [2] Expected saving comes from Mel removing a local_irq_{save,restore}
> and adding a preempt_{disable,enable} instead.
>
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
>
> time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.232 ns (step:0)
> time_bench: Type:spin_lock_unlock Per elem: 33 cycles(tsc) 8.334 ns (step:0)
> time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
> time_bench: Type:irqsave_before_lock Per elem: 57 cycles(tsc) 14.344 ns (step:0)
> time_bench: Type:spin_lock_unlock_irq Per elem: 34 cycles(tsc) 8.560 ns (step:0)
> time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
> time_bench: Type:local_BH_disable_enable Per elem: 19 cycles(tsc) 4.920 ns (step:0)
> time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
> time_bench: Type:local_irq_save_restore Per elem: 38 cycles(tsc) 9.665 ns (step:0)
> [Mel's patch removes a ^^^^^^^^^^^^^^^^] ^^^^^^^^^ expected saving - preempt cost
> time_bench: Type:preempt_disable_enable Per elem: 11 cycles(tsc) 2.794 ns (step:0)
> [adds a preempt ^^^^^^^^^^^^^^^^^^^^^^] ^^^^^^^^^ adds this cost
> time_bench: Type:funcion_call_cost Per elem: 6 cycles(tsc) 1.689 ns (step:0)
> time_bench: Type:func_ptr_call_cost Per elem: 11 cycles(tsc) 2.767 ns (step:0)
> time_bench: Type:page_alloc_put Per elem: 211 cycles(tsc) 52.803 ns (step:0)
>
> Thus, expected improvement is: 38-11 = 27 cycles.
>
> [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
>
> CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
>
> Config options of interest:
> CONFIG_NUMA=y
> CONFIG_DEBUG_LIST=n
> CONFIG_VM_EVENT_COUNTERS=y



--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer