Re: [PATCH v2 00/16] perf arm64: Support data type profiling

From: James Clark

Date: Mon Apr 20 2026 - 05:31:56 EST




On 17/04/2026 02:53, Tengda Wu wrote:


On 2026/4/16 23:31, James Clark wrote:


On 03/04/2026 10:47, Tengda Wu wrote:
This patch series implements data type profiling support for arm64,
building upon the foundational work previously contributed by Huafei [1].
While the initial version laid the groundwork for arm64 data type analysis,
this series iterates on that work by refining instruction parsing and
extending support for core architectural features.

The series is organized as follows:

1. Fix disassembly mismatches (Patches 01-02)
    Current perf annotate supports three disassembly backends: llvm,
    capstone, and objdump. On arm64, inconsistencies between the output
    of these backends (specifically llvm/capstone vs. objdump) often
    prevent the tracker from correctly identifying registers and offsets.
    These patches resolve these mismatches, ensuring consistent instruction
    parsing across all supported backends.

2. Infrastructure for arm64 operand parsing (Patches 03-07)
    These patches establish the necessary infrastructure for arm64-specific
    operand handling. This includes implementing new callbacks and data
    structures to manage arm64's unique addressing modes and register sets.
    This foundation is essential for the subsequent type-tracking logic.

3. Core instruction tracking (Patches 08-16)
    These patches implement the core logic for type tracking on arm64,
    covering a wide range of instructions including:

    * Memory Access: ldr/str variants (including stack-based access).
    * Arithmetic & Data Processing: mov, add, and adrp.
    * Special Access: System register access (mrs) and per-cpu variable
      tracking.

The implementation draws inspiration from the existing x86 logic while
adapting it to the nuances of the AArch64 ISA [2][3]. With these changes,
perf annotate can successfully resolve memory locations and register
types, enabling comprehensive data type profiling on arm64 platforms.

Example Result
==============

# perf mem record -a -K -- sleep 1
# perf annotate --data-type --type-stat --stdio

Hi Tengda,

Did you run this with any itrace options? If I run your command I get repeated blocks of duplicate stats and types, which is very confusing. One for each sample type that we generate decoding SPE.

For example the default perf report output has all these groups:

  Available samples
  0 arm_spe_0/
    ts_enable=1,pa_enable=1,load_filter=1,store_filter=1,min_latency=30/
  0 dummy:u
  3 l1d-miss
  18 l1d-access
  0 llc-miss
  0 llc-access
  0 tlb-miss
  22 tlb-access
  0 branch
  0 remote-access
  22 memory
  22 instructions

Obviously there are 22 samples total (instructions) and they get duplicated into whatever other categories they happen to have flags for.


Yes, I agree. The duplication makes the type stats misleading.
De-duplication is definitely necessary here.

To remove the duplicates you have to do --itrace=i1i. Could that need to be default for perf annotate with SPE?

I'll look into making this the default behavior for SPE data-type
annotation.

Annotate data type stats:
total 6204, ok 5091 (82.1%), bad 1113 (17.9%)
-----------------------------------------------------------
         29 : no_sym
        196 : no_var
        806 : no_typeinfo
         82 : bad_offset
       1370 : insn_track


Here are the results with --itrace=i1i (a slight decrease in accuracy):

Annotate data type stats:
total 1138, ok 877 (77.1%), bad 261 (22.9%)

I'm still it a bit confused why you seem to get a 'total' count that is a sum of all the sample groups, if it went from 6204 to 1138 when you only asked for the instructions samples. Whereas I get separate groups, and asking for only instructions samples doesn't change the value for the last 'total', it just removes the other outputs.

It shouldn't change the accuracy either because the instruction group is the top level one which contains all of the samples.

-----------------------------------------------------------
6 : no_sym
44 : no_var
197 : no_typeinfo
14 : bad_offset
238 : insn_track

This will be the new baseline, and I will work on further optimizations
from here.

Best regards,
Tengda

Annotate type: 'struct page' in [kernel.kallsyms] (59208 samples):
============================================================================
  Percent     offset       size  field
   100.00          0       0x40  struct page      {
     9.95          0        0x8      long unsigned int   flags;
    52.83        0x8       0x28      union        {
    52.83        0x8       0x28          struct   {
    37.21        0x8       0x10              union        {
    37.21        0x8       0x10                  struct list_head        lru {
    37.21        0x8        0x8                      struct list_head*   next;
     0.00       0x10        0x8                      struct list_head*   prev;
                                                 };
    37.21        0x8       0x10                  struct   {
    37.21        0x8        0x8                      void*       __filler;
     0.00       0x10        0x4                      unsigned int        mlock_count;
    ...

Changes since v1: (reworked from Huafei's series):

  - Fix inconsistencies in arm64 instruction output across llvm, capstone,
    and objdump disassembly backends.
  - Support arm64-specific addressing modes and operand formats. (Leo Yan)
  - Extend instruction tracking to support mov and add instructions,
    along with per-cpu and stack variables.
  - Include real-world examples in commit messages to demonstrate
    practical effects. (Namhyung Kim)
  - Improve type-tracking success rate (type stat) from 64.2% to 82.1%.
    https://lore.kernel.org/all/20250314162137.528204-1-lihuafei1@xxxxxxxxxx/

Please let me know if you have any feedback.

Thanks,
Tengda

[1] https://lore.kernel.org/all/20250314162137.528204-1-lihuafei1@xxxxxxxxxx/
[2] https://developer.arm.com/documentation/102374/0103
[3] https://github.com/flynd/asmsheets/releases/tag/v8

---

Tengda Wu (16):
   perf llvm: Fix arm64 adrp instruction disassembly mismatch with
     objdump
   perf capstone: Fix arm64 jump/adrp disassembly mismatch with objdump
   perf annotate-arm64: Generalize arm64_mov__parse to support standard
     operands
   perf annotate-arm64: Handle load and store instructions
   perf annotate: Introduce extract_op_location callback for
     arch-specific parsing
   perf dwarf-regs: Adapt get_dwarf_regnum() for arm64
   perf annotate-arm64: Implement extract_op_location() callback
   perf annotate-arm64: Enable instruction tracking support
   perf annotate-arm64: Support load instruction tracking
   perf annotate-arm64: Support store instruction tracking
   perf annotate-arm64: Support stack variable tracking
   perf annotate-arm64: Support 'mov' instruction tracking
   perf annotate-arm64: Support 'add' instruction tracking
   perf annotate-arm64: Support 'adrp' instruction to track global
     variables
   perf annotate-arm64: Support per-cpu variable access tracking
   perf annotate-arm64: Support 'mrs' instruction to track 'current'
     pointer

  .../perf/util/annotate-arch/annotate-arm64.c  | 642 +++++++++++++++++-
  .../util/annotate-arch/annotate-powerpc.c     |  10 +
  tools/perf/util/annotate-arch/annotate-x86.c  |  88 ++-
  tools/perf/util/annotate-data.c               |  72 +-
  tools/perf/util/annotate-data.h               |   7 +-
  tools/perf/util/annotate.c                    | 108 +--
  tools/perf/util/annotate.h                    |  12 +
  tools/perf/util/capstone.c                    | 107 ++-
  tools/perf/util/disasm.c                      |   5 +
  tools/perf/util/disasm.h                      |   5 +
  .../util/dwarf-regs-arch/dwarf-regs-arm64.c   |  20 +
  tools/perf/util/dwarf-regs.c                  |   2 +-
  tools/perf/util/include/dwarf-regs.h          |   1 +
  tools/perf/util/llvm.c                        |  50 ++
  14 files changed, 984 insertions(+), 145 deletions(-)


base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb