[PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch

From: Alexey Budankov
Date: Fri Oct 25 2019 - 04:35:52 EST

Next message: Daniel Thompson: "Re: [PATCH v3 02/10] mfd: cs5535-mfd: Remove mfd_cell->id hack"
Previous message: Paul Menzel: "Re: File system for scratch space (in HPC cluster)"
Next in thread: Alexey Budankov: "[PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Restore Intel LBR call stack from cloned inactive task perf context on
a context switch. This change inherently addresses inconsistency in LBR
call stack data provided on a sample in record profiling mode:

$ perf record -N -B -T -R --call-graph lbr \
-e cpu/period=0xcdfe60,event=0x3c,name=\'CPU_CLK_UNHALTED.THREAD\'/Duk \
--clockid=monotonic_raw -- ./miniFE.x nx 25 ny 25 nz 25

Let's assume threads A, B, C belonging to the same process.
B and C are siblings of A and their perf contexts are treated as equivalent.
At some point B blocks on a futex (non preempt context switch).
B's LBRs are preserved at B's perf context task_ctx_data and B's events
are removed from PMU and disabled. B's perf context becomes inactive.

Later C gets on a cpu, runs, gets profiled and eventually switches to
the awaken but not yet running B. The optimized context switch path is
executed swapping B's and C's task_ctx_data pointers at perf event contexts.
So C's task_ctx_data will refer preserved B's LBRs on the following
switch-in event.

However, as far B's perf context is inactive there is no enabled events
in there and B's task_ctx_data->lbr_callstack_users is equal to 0.
When B gets on the cpu B's events reviving is skipped following
the optimized context switch path and B's task_ctx_data->lbr_callstack_users
remains 0. Thus B's LBR's are not restored by pmu sched_task() code called
in the end of perf context switch-in callback for B.

In the report that manifests as having short fragments of B's
call stack, still tracked by LBR's HW between adjacent samples,
but the whole thread call tree doesn't aggregate.

The fix has been evaluated when profiling miniFE [1] (C++, OpenMP)
workload running 64 threads on Intel Skylake EP (64 core, 2 sockets):

$ perf report --call-graph callee,flat

5.3.0-rc6+ (tip perf/core) - fixed

- 92.66% 82.64% miniFE.x libiomp5.so [.] _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
- 69.14% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_fork_barrier
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
start_thread
__clone
- 21.89% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_barrier
__kmpc_reduce_nowait
miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
__kmp_invoke_microtask
__kmp_invoke_task_func
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
start_thread
__clone
- 1.63% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_barrier
__kmpc_reduce_nowait
main
__kmp_invoke_microtask
__kmp_invoke_task_func
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
start_thread
__clone

5.0.13-300.fc30.x86_64 - no fix

- 90.29% 81.01% miniFE.x libiomp5.so [.] _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
- 33.45% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_fork_barrier
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
start_thread
__clone
87.63% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
- 54.79% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_fork_barrier
__kmp_launch_thread
- 9.18% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_barrier
__kmpc_reduce_nowait
miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
__kmp_invoke_microtask
__kmp_invoke_task_func
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
start_thread
__clone
- 41.28% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_fork_barrier
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
- 15.77% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_barrier
__kmpc_reduce_nowait
miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
__kmp_invoke_microtask
__kmp_invoke_task_func
__kmp_launch_thread
- 11.56% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_barrier
__kmpc_reduce_nowait
miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
__kmp_invoke_microtask
__kmp_invoke_task_func
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
- 2.33% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
__kmp_barrier
__kmpc_reduce_nowait
main
__kmp_invoke_microtask
__kmp_invoke_task_func
__kmp_launch_thread
_INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
start_thread
__clone
0.67% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_gather
0.57% __kmp_hardware_timestamp

[1] https://www.hpcadvisorycouncil.com/pdf/miniFE_Analysis_and_Profiling.pdf

---
Alexey Budankov (4):
perf/core,x86: introduce swap_task_ctx() method at struct pmu
perf/x86: install platform specific swap_task_ctx adapter
perf/x86/intel: implement LBR callstacks context synchronization
perf/core,x86: synchronize PMU task contexts on optimized context switches

arch/x86/events/core.c | 8 ++++++++
arch/x86/events/intel/core.c | 7 +++++++
arch/x86/events/intel/lbr.c | 23 +++++++++++++++++++++++
arch/x86/events/perf_event.h | 11 +++++++++++
include/linux/perf_event.h | 9 +++++++++
kernel/events/core.c | 13 ++++++++++++-
6 files changed, 70 insertions(+), 1 deletion(-)

---
Changes in v5:
- renamed sync_task_ctx to swap_task_ctx;
- converted type of prev and next swap_task_ctx() params to struct perf_event_context;
- implemented check on availability of swap_task_ctx() pointer in pmu type
at perf core implementation;
- moved swap of ctx->task_ctx_data pointers to architecture specific
intel_pmu_lbr_swap_task_ctx() implementation;

Changes in v4:
- moved check on simultaneous task_ctx_data objects availability
to the perf/core layer;
- marked sync_task_ctx() as the optional in code comments;
- renamed params of sync_task_ctx() to prev and next;

Changes in v3:
- replaced assignment with swap at intel_pmu_lbr_sync_task_ctx()

Changes in v2:
- implemented sync_task_ctx() method at perf,x86,intel pmu types;
- employed the method on the optimized context switch path between
equivalent perf event contexts;

--
2.20.1

Next message: Daniel Thompson: "Re: [PATCH v3 02/10] mfd: cs5535-mfd: Remove mfd_cell->id hack"
Previous message: Paul Menzel: "Re: File system for scratch space (in HPC cluster)"
Next in thread: Alexey Budankov: "[PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]