Re: [PATCH] perf intel-pt: don't zero the whole perf_sample

From: Ian Rogers
Date: Mon Jan 13 2025 - 14:46:01 EST


On Mon, Jan 13, 2025 at 9:30 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:
>
> On Mon, Jan 13, 2025 at 12:15 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
> >
> > On 11/01/25 19:56, Tavian Barnes wrote:
> > > C designated initializers like
> > >
> > > struct perf_sample sample = { .ip = 0, };
> > >
> > > set every unmentioned field of the struct to zero. But since
> > > sizeof(struct perf_sample) == 1384, this takes a long time.
> > >
> > > struct perf_sample does not need to be fully initialized, and even
> >
> > Yes it does need to be fully initialized. Leaving members
> > uninitialized in the hope that they never get used adds to
> > code complexity e.g. how do you know they never are used,
> > or future members never will be used.
>
> First thought, perhaps we can use sanitizers to catch mistakes but
> leave the default optimized code to not have any overhead.
>
> Second thought, the perf_sample looks like:
> ```
> $ pahole perf
> ...
> struct perf_sample {
> u64 ip; /* 0 8 */
> u32 pid; /* 8 4 */
> u32 tid; /* 12 4 */
> u64 time; /* 16 8 */
> u64 addr; /* 24 8 */
> u64 id; /* 32 8 */
> u64 stream_id; /* 40 8 */
> u64 period; /* 48 8 */
> u64 weight; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> u64 transaction; /* 64 8 */
> u64 insn_cnt; /* 72 8 */
> u64 cyc_cnt; /* 80 8 */
> u32 cpu; /* 88 4 */
> u32 raw_size; /* 92 4 */
> u64 data_src; /* 96 8 */
> u64 phys_addr; /* 104 8 */
> u64 data_page_size; /* 112 8 */
> u64 code_page_size; /* 120 8 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> u64 cgroup; /* 128 8 */
> u32 flags; /* 136 4 */
> u32 machine_pid; /* 140 4 */
> u32 vcpu; /* 144 4 */
> u16 insn_len; /* 148 2 */
> u8 cpumode; /* 150 1 */
>
> /* XXX 1 byte hole, try to pack */
>
> u16 misc; /* 152 2 */
> u16 ins_lat; /* 154 2 */
> union {
> u16 p_stage_cyc; /* 156 2 */
> u16 retire_lat; /* 156 2 */
> }; /* 156 2 */
> union {
> u16 p_stage_cyc; /*
> 0 2 */
> u16 retire_lat; /*
> 0 2 */
> };
>
> _Bool no_hw_idx; /* 158 1 */
> char insn[16]; /* 159 16 */
>
> /* XXX 1 byte hole, try to pack */
>
> void * raw_data; /* 176 8 */
> struct ip_callchain * callchain; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> struct branch_stack * branch_stack; /* 192 8 */
> u64 * branch_stack_cntr; /* 200 8 */
> struct regs_dump user_regs; /* 208 544 */
> /* --- cacheline 11 boundary (704 bytes) was 48 bytes ago --- */
> struct regs_dump intr_regs; /* 752 544 */
> /* --- cacheline 20 boundary (1280 bytes) was 16 bytes ago --- */
> struct stack_dump user_stack; /* 1296 24 */
> struct sample_read read; /* 1320 40 */
> /* --- cacheline 21 boundary (1344 bytes) was 16 bytes ago --- */
> struct aux_sample aux_sample; /* 1360 16 */
> struct simd_flags simd_flags; /* 1376 8 */
>
> /* size: 1384, cachelines: 22, members: 39 */
> /* sum members: 1382, holes: 2, sum holes: 2 */
> /* last cacheline: 40 bytes */
> };
> ```
> The two regs_dump structs are 79% of the size. They comprise of:
> ```
> struct regs_dump {
> u64 abi; /* 0 8 */
> u64 mask; /* 8 8 */
> u64 * regs; /* 16 8 */
> u64 cache_regs[64]; /* 24 512 */
> /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */
> u64 cache_mask; /* 536 8 */
>
> /* size: 544, cachelines: 9, members: 5 */
> /* last cacheline: 32 bytes */
> };
> ```
> So the cache_regs are taking up the majority of the space. I have a
> WIP patch set to make user_regs and intr_regs optional/lazily
> allocated. I'll send it out shortly.

Sent as:
https://lore.kernel.org/lkml/20250113194345.1537821-1-irogers@xxxxxxxxxx/

Thanks,
Ian