Re: [PATCH v1 2/3] perf record: apply affinity masks when reading mmap buffers

From: Alexey Budankov
Date: Thu Dec 13 2018 - 02:08:14 EST



Hi,
On 12.12.2018 15:15, Jiri Olsa wrote:
> On Wed, Dec 12, 2018 at 10:40:22AM +0300, Alexey Budankov wrote:
>>
>> Build node cpu masks for mmap data buffers. Bind AIO data buffers
>> to nodes according to kernel data buffers location. Apply node cpu
>> masks to trace reading thread every time it references memory cross
>> node or cross cpu.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx>
>> ---
>> tools/perf/builtin-record.c | 9 +++++++++
>> tools/perf/util/evlist.c | 6 +++++-
>> tools/perf/util/mmap.c | 38 ++++++++++++++++++++++++++++++++++++-
>> tools/perf/util/mmap.h | 1 +
>> 4 files changed, 52 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 4979719e54ae..1a1438c73f96 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -532,6 +532,9 @@ static int record__mmap_evlist(struct record *rec,
>> struct record_opts *opts = &rec->opts;
>> char msg[512];
>>
>> + if (opts->affinity != PERF_AFFINITY_SYS)
>> + cpu__setup_cpunode_map();
>> +
>> if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
>> opts->auxtrace_mmap_pages,
>> opts->auxtrace_snapshot_mode,
>> @@ -751,6 +754,12 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>> struct perf_mmap *map = &maps[i];
>>
>> if (map->base) {
>> + if (rec->opts.affinity != PERF_AFFINITY_SYS &&
>> + !CPU_EQUAL(&rec->affinity_mask, &map->affinity_mask)) {
>> + CPU_ZERO(&rec->affinity_mask);
>> + CPU_OR(&rec->affinity_mask, &rec->affinity_mask, &map->affinity_mask);
>> + sched_setaffinity(0, sizeof(rec->affinity_mask), &rec->affinity_mask);
>> + }
>
> hum, so you change affinity every time you read different map?

That is what exactly happens when --affinity=cpu. With --affinity=node
thread affinity changes only when the thread gets mmap buffer allocated
at the remote node. For dual socket machine it is twice at max for one
loop execution.

> I'm surprised this is actualy faster..

Imagine that some app's thread running on cpu 0 of node 1 generates samples
into a kernel buffer which is also allocated at node 1. The tool thread
running on cpu 0 of node 0 takes the buffer and puts some part of it into
write syscall what can cause cross node memory move and induce collection
overhead (from the kernel buffer into fs cache buffers executing some portion
of write syscall code on cpu 0 of node 0).

>
> anyway this patch is doing 2 things.. binding the memory allocation
> to nodes and setting the process affinity, please seprate those and
> explain the logic behind

Separated in v2. Binding is implemented for AIO user space buffers only
to map them to the same nodes kernel buffers are mapped to. Tool thread
affinity mask bouncing is implemented and applicable as for serial as
for AIO streaming. AIO streaming without binding can result in cross node
memory moves from kernel buffers to AIO ones.

Thanks,
Alexey

>
> thanks,
> jirka
>