Re: [RFC 0/6] the big khugepaged redesign
From: Andres Freund
Date: Thu Mar 05 2015 - 19:21:25 EST
Long mail ahead, sorry for that.
TL;DR: THP is still noticeable, but not nearly as bad.
On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
> That however means the workload is based on hugetlbfs and shouldn't trigger THP
> page fault activity, which is the aim of this patchset. Some more googling made
> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
> patchset should help, but I obviously won't be able to measure this before LSF/MM...
Just as a reference, this is how some the more extreme profiles looked
like in the past:
> 96.50% postmaster [kernel.kallsyms] [k] _spin_lock_irq
> |
> --- _spin_lock_irq
> |
> |--99.87%-- compact_zone
> | compact_zone_order
> | try_to_compact_pages
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | __do_page_fault
> | do_page_fault
> | page_fault
> | 0x631d98
> --0.13%-- [...]
That specific profile is from a rather old kernel as you probably
recognize.
> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
> psql performance with THPs enabled/disabled on recent kernels, where e.g.
> compaction is no longer synchronous for THP page faults?
So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
RAM.
First of: It's noticeably harder to trigger problems than it used to
be. But, I can still trigger various problems that are much worse with
THP enabled than without.
There seem to be various different bottlenecks; I can get somewhat
different profiles.
In a somewhat artificial workload, that tries to simulate what I've seen
trigger the problem at a customer, I can quite easily trigger large
differences between THP=enable and THP=never. There's two types of
tasks running, one purely OLTP, another doing somewhat more complex
statements that require a fair amount of process local memory.
(ignore the absolute numbers for progress, I just waited for somewhat
stable results while doing other stuff)
THP off:
Task 1 solo:
progress: 200.0 s, 391442.0 tps, 0.654 ms lat
progress: 201.0 s, 394816.1 tps, 0.683 ms lat
progress: 202.0 s, 409722.5 tps, 0.625 ms lat
progress: 203.0 s, 384794.9 tps, 0.665 ms lat
combined:
Task 1:
progress: 144.0 s, 25430.4 tps, 10.067 ms lat
progress: 145.0 s, 22260.3 tps, 11.500 ms lat
progress: 146.0 s, 24089.9 tps, 10.627 ms lat
progress: 147.0 s, 25888.8 tps, 9.888 ms lat
Task 2:
progress: 24.4 s, 30.0 tps, 2134.043 ms lat
progress: 26.5 s, 29.8 tps, 2150.487 ms lat
progress: 28.4 s, 29.7 tps, 2151.557 ms lat
progress: 30.4 s, 28.5 tps, 2245.304 ms lat
flat profile:
6.07% postgres postgres [.] heap_form_minimal_tuple
4.36% postgres postgres [.] heap_fill_tuple
4.22% postgres postgres [.] ExecStoreMinimalTuple
4.11% postgres postgres [.] AllocSetAlloc
3.97% postgres postgres [.] advance_aggregates
3.94% postgres postgres [.] advance_transition_function
3.94% postgres postgres [.] ExecMakeTableFunctionResult
3.33% postgres postgres [.] heap_compute_data_size
3.30% postgres postgres [.] MemoryContextReset
3.28% postgres postgres [.] ExecScan
3.04% postgres postgres [.] ExecProject
2.96% postgres postgres [.] generate_series_step_int4
2.94% postgres [kernel.kallsyms] [k] clear_page_c
(i.e. most of it postgres, cache miss bound)
THP on:
Task 1 solo:
progress: 140.0 s, 390458.1 tps, 0.656 ms lat
progress: 141.0 s, 391174.2 tps, 0.654 ms lat
progress: 142.0 s, 394828.8 tps, 0.648 ms lat
progress: 143.0 s, 398156.2 tps, 0.643 ms lat
Task 1:
progress: 179.0 s, 23963.1 tps, 10.683 ms lat
progress: 180.0 s, 22712.9 tps, 11.271 ms lat
progress: 181.0 s, 21211.4 tps, 12.069 ms lat
progress: 182.0 s, 23207.8 tps, 11.031 ms lat
Task 2:
progress: 28.2 s, 19.1 tps, 3349.747 ms lat
progress: 31.0 s, 19.8 tps, 3230.589 ms lat
progress: 34.3 s, 21.5 tps, 2979.113 ms lat
progress: 37.4 s, 20.9 tps, 3055.143 ms lat
flat profile:
21.36% postgres [kernel.kallsyms] [k] pageblock_pfn_to_page
4.93% postgres postgres [.] ExecStoreMinimalTuple
4.02% postgres postgres [.] heap_form_minimal_tuple
3.55% postgres [kernel.kallsyms] [k] clear_page_c
2.85% postgres postgres [.] heap_fill_tuple
2.60% postgres postgres [.] ExecMakeTableFunctionResult
2.57% postgres postgres [.] AllocSetAlloc
2.44% postgres postgres [.] advance_transition_function
2.43% postgres postgres [.] generate_series_step_int4
callgraph:
18.23% postgres [kernel.kallsyms] [k] pageblock_pfn_to_page
|
--- pageblock_pfn_to_page
|
|--99.05%-- isolate_migratepages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
....
|
--0.95%-- compact_zone
compact_zone_order
try_to_compact_pages
__alloc_pages_direct_compact
__alloc_pages_nodemask
alloc_pages_vma
do_huge_pmd_anonymous_page
__handle_mm_fault
handle_mm_fault
__do_page_fault
4.98% postgres postgres [.] ExecStoreMinimalTuple
|
4.20% postgres postgres [.] heap_form_minimal_tuple
|
3.69% postgres [kernel.kallsyms] [k] clear_page_c
|
--- clear_page_c
|
|--58.89%-- __do_huge_pmd_anonymous_page
| do_huge_pmd_anonymous_page
| __handle_mm_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
As you can see THP on/off makes a noticeable difference, especially for
Task 2. Compaction suddenly takes a significant amount of time. But:
It's a relatively gradual slowdown, at pretty extreme concurrency. So
I'm pretty happy already.
In the workload tested here most non-shared allocations are short
lived. So it's not surprising that it's not worth compacting pages. I do
wonder whether it'd be possible to keep some running statistics about
THP being worthwhile or not.
This is just one workload, and I saw some different profiles while
playing around. But I've already invested more time in this today than I
should have... :)
BTW, parallel process exits with large shared mappings isn't
particularly fun:
80.09% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|
--- _raw_spin_lock_irqsave
|
|--99.97%-- pagevec_lru_move_fn
| |
| |--65.51%-- activate_page
| | mark_page_accessed.part.23
| | mark_page_accessed
| | zap_pte_range
| | unmap_page_range
| | unmap_single_vma
| | unmap_vmas
| | exit_mmap
| | mmput.part.27
| | mmput
| | exit_mm
| | do_exit
| | do_group_exit
| | sys_exit_group
| | system_call_fastpath
| |
| --34.49%-- lru_add_drain_cpu
| lru_add_drain
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| zap_pte_range
| unmap_page_range
| unmap_single_vma
| unmap_vmas
| exit_mmap
| mmput.part.27
| mmput
| exit_mm
| do_exit
| do_group_exit
| sys_exit_group
| system_call_fastpath
--0.03%-- [...]
9.75% postgres [kernel.kallsyms] [k] zap_pte_range
|
--- zap_pte_range
unmap_page_range
unmap_single_vma
unmap_vmas
exit_mmap
mmput.part.27
mmput
exit_mm
do_exit
do_group_exit
sys_exit_group
system_call_fastpath
1.93% postgres [kernel.kallsyms] [k] release_pages
|
--- release_pages
|
|--77.09%-- free_pages_and_swap_cache
| tlb_flush_mmu_free
| zap_pte_range
| unmap_page_range
| unmap_single_vma
| unmap_vmas
| exit_mmap
| mmput.part.27
| mmput
| exit_mm
| do_exit
| do_group_exit
| sys_exit_group
| system_call_fastpath
|
|--22.64%-- pagevec_lru_move_fn
| |
| |--63.88%-- activate_page
| | mark_page_accessed.part.23
| | mark_page_accessed
| | zap_pte_range
| | unmap_page_range
| | unmap_single_vma
| | unmap_vmas
| | exit_mmap
| | mmput.part.27
| | mmput
| | exit_mm
| | do_exit
| | do_group_exit
| | sys_exit_group
| | system_call_fastpath
| |
| --36.12%-- lru_add_drain_cpu
| lru_add_drain
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| zap_pte_range
| unmap_page_range
| unmap_single_vma
| unmap_vmas
| exit_mmap
| mmput.part.27
| mmput
| exit_mm
| do_exit
| do_group_exit
| sys_exit_group
| system_call_fastpath
--0.27%-- [...]
1.91% postgres [kernel.kallsyms] [k] page_remove_file_rmap
|
--- page_remove_file_rmap
|
|--98.18%-- page_remove_rmap
| zap_pte_range
| unmap_page_range
| unmap_single_vma
| unmap_vmas
| exit_mmap
| mmput.part.27
| mmput
| exit_mm
| do_exit
| do_group_exit
| sys_exit_group
| system_call_fastpath
|
--1.82%-- zap_pte_range
unmap_page_range
unmap_single_vma
unmap_vmas
exit_mmap
mmput.part.27
mmput
exit_mm
do_exit
do_group_exit
sys_exit_group
system_call_fastpath
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/