Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
From: Alexey Budankov
Date: Mon Sep 10 2018 - 06:40:23 EST
On 10.09.2018 12:18, Ingo Molnar wrote:
> * Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx> wrote:
>> Currently in record mode the tool implements trace writing serially.
>> The algorithm loops over mapped per-cpu data buffers and stores
>> ready data chunks into a trace file using write() system call.
>> At some circumstances the kernel may lack free space in a buffer
>> because the other buffer's half is not yet written to disk due to
>> some other buffer's data writing by the tool at the moment.
>> Thus serial trace writing implementation may cause the kernel
>> to loose profiling data and that is what observed when profiling
>> highly parallel CPU bound workloads on machines with big number
>> of cores.
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
>> Data loss metrics is the ratio lost_time/elapsed_time where
>> lost_time is the sum of time intervals containing PERF_RECORD_LOST
>> records and elapsed_time is the elapsed application run time
>> under profiling.
>> Applying asynchronous trace streaming thru Posix AIO API
>> lowers data loss metrics value providing 2x improvement -
>> lowering 98% loss to almost 0%.
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back
> to threads anyway when there's no kernel AIO support (which there probably isn't for perf
Explicit threading is surely an option but having more threads
in the tool that stream performance data is a considerable
Luckily, glibc AIO implementation is already based on pthreads,
but having a writing thread for every distinct fd only.
> Per-CPU threading the record session would have so many other advantages as well (scalability,
> Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the
Tool threads may contend, and actually do, with application
threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.