Re: [PATCH] perf scripts python arm-cs-trace-disasm.py: Skip disasm if address continuity is broken

From: James Clark
Date: Tue Aug 06 2024 - 05:57:43 EST




On 06/08/2024 10:47 am, James Clark wrote:


On 06/08/2024 8:02 am, Ganapatrao Kulkarni wrote:


On 05-08-2024 07:29 pm, James Clark wrote:


On 05/08/2024 1:22 pm, Ganapatrao Kulkarni wrote:


On 01-08-2024 03:30 pm, James Clark wrote:


On 24/07/2024 3:45 pm, James Clark wrote:


On 24/07/2024 7:38 am, Ganapatrao Kulkarni wrote:


On 23-07-2024 09:16 pm, James Clark wrote:


On 23/07/2024 4:26 pm, Ganapatrao Kulkarni wrote:


On 23-07-2024 06:40 pm, James Clark wrote:


On 22/07/2024 11:02 am, Ganapatrao Kulkarni wrote:

Hi James,

On 19-07-2024 08:09 pm, James Clark wrote:


On 19/07/2024 10:26 am, Ganapatrao Kulkarni wrote:
To generate the instruction tracing, script uses 2 contiguous packets
address range. If there a continuity brake due to discontiguous branch
address, it is required to reset the tracing and start tracing with the
new set of contiguous packets.

Adding change to identify the break and complete the remaining tracing
of current packets and restart tracing from new set of packets, if
continuity is established.


Hi Ganapatrao,

Can you add a before and after example of what's changed to the commit message? It wasn't immediately obvious to me if this is adding missing output, or it was correcting the tail end of the output that was previously wrong.

It is adding tail end of the trace as well avoiding the segfault of the perf application. With out this change the perf segfaults with as below log


./perf script --script=python:./scripts/python/arm-cs-trace-disasm.py -- -d objdump -k ../../vmlinux -v $* > dump
objdump: error: the stop address should be after the start address
Traceback (most recent call last):
   File "./scripts/python/arm-cs-trace-disasm.py", line 271, in process_event
     print_disam(dso_fname, dso_vm_start, start_addr, stop_addr)
   File "./scripts/python/arm-cs-trace-disasm.py", line 105, in print_disam
     for line in read_disam(dso_fname, dso_start, start_addr, stop_addr):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "./scripts/python/arm-cs-trace-disasm.py", line 99, in read_disam
     disasm_output = check_output(disasm).decode('utf-8').split('\n')
                     ^^^^^^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/subprocess.py", line 466, in check_output
     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/subprocess.py", line 571, in run
     raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['objdump', '-d', '-z', '--start-address=0xffff80008125b758', '--stop-address=0xffff80008125a934', '../../vmlinux']' returned non-zero exit status 1.
Fatal Python error: handler_call_die: problem in Python trace event handler
Python runtime state: initialized

Current thread 0x0000ffffb05054e0 (most recent call first):
   <no Python frame>

Extension modules: perf_trace_context, systemd._journal, systemd._reader, systemd.id128, report._py3report, _dbus_bindings, problem._py3abrt (total: 7)
Aborted (core dumped)


Signed-off-by: Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx>
---
  tools/perf/scripts/python/arm-cs-trace-disasm.py | 10 ++++++++++
  1 file changed, 10 insertions(+)

diff --git a/tools/perf/scripts/python/arm-cs-trace-disasm.py b/tools/perf/scripts/python/arm-cs-trace-disasm.py
index d973c2baed1c..ad10cee2c35e 100755
--- a/tools/perf/scripts/python/arm-cs-trace-disasm.py
+++ b/tools/perf/scripts/python/arm-cs-trace-disasm.py
@@ -198,6 +198,10 @@ def process_event(param_dict):
          cpu_data[str(cpu) + 'addr'] = addr
          return
+    if (cpu_data.get(str(cpu) + 'ip') == None):
+        cpu_data[str(cpu) + 'ip'] = ip
+

Do you need to write into the global cpu_data here? Doesn't it get overwritten after you load it back into 'prev_ip'

No, the logic is same as holding the addr of previous packet.
Saving the previous packet saved ip in to prev_ip before overwriting with the current packet.

It's not exactly the same logic as holding the addr of the previous sample. For addr, we return on the first None, with your change we now "pretend" that the second one is also the previous one:

   if (cpu_data.get(str(cpu) + 'addr') == None):
     cpu_data[str(cpu) + 'addr'] = addr
     return  <----------------------------sample 0 return

   if (cpu_data.get(str(cpu) + 'ip') == None):
       cpu_data[str(cpu) + 'ip'] = ip <---- sample 1 save but no return

Then for sample 1 'prev_ip' is actually now the 'current' IP:

Yes, it is dummy for first packet. Added anticipating that we wont hit the discontinuity for the first packet itself.

Can this be changed to more intuitive like below?

diff --git a/tools/perf/scripts/python/arm-cs-trace-disasm.py b/tools/perf/scripts/python/arm-cs-trace-disasm.py
index d973c2baed1c..d49f5090059f 100755
--- a/tools/perf/scripts/python/arm-cs-trace-disasm.py
+++ b/tools/perf/scripts/python/arm-cs-trace-disasm.py
@@ -198,6 +198,8 @@ def process_event(param_dict):
                 cpu_data[str(cpu) + 'addr'] = addr
                 return

+       if (cpu_data.get(str(cpu) + 'ip') != None):
+               prev_ip = cpu_data[str(cpu) + 'ip']

         if (options.verbose == True):
                 print("Event type: %s" % name)
@@ -243,12 +245,18 @@ def process_event(param_dict):

         # Record for previous sample packet
         cpu_data[str(cpu) + 'addr'] = addr
+       cpu_data[str(cpu) + 'ip'] = stop_addr

         # Handle CS_ETM_TRACE_ON packet if start_addr=0 and stop_addr=4
         if (start_addr == 0 and stop_addr == 4):
                 print("CPU%d: CS_ETM_TRACE_ON packet is inserted" % cpu)
                 return

+       if (stop_addr < start_addr and prev_ip != 0):
+               # Continuity of the Packets broken, set start_addr to previous
+               # packet ip to complete the remaining tracing of the address range.
+               start_addr = prev_ip
+
         if (start_addr < int(dso_start) or start_addr > int(dso_end)):
                 print("Start address 0x%x is out of range [ 0x%x .. 0x%x ] for dso %s" % (start_addr, int(dso_start), int(dso_end), dso))
                 return

Without this patch below is the failure log(with segfault) for reference.

[root@sut01sys-r214 perf]# timeout 4s ./perf record -e cs_etm// -C 1 dd if=/dev/zero of=/dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.087 MB perf.data ]
[root@sut01sys-r214 perf]# ./perf script --script=python:./scripts/python/arm-cs-trace-disasm.py -- -d objdump -k ../../vmlinux -v $* > dump
objdump: error: the stop address should be after the start address
Traceback (most recent call last):
   File "./scripts/python/arm-cs-trace-disasm.py", line 271, in process_event
     print_disam(dso_fname, dso_vm_start, start_addr, stop_addr)
   File "./scripts/python/arm-cs-trace-disasm.py", line 105, in print_disam
     for line in read_disam(dso_fname, dso_start, start_addr, stop_addr):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "./scripts/python/arm-cs-trace-disasm.py", line 99, in read_disam
     disasm_output = check_output(disasm).decode('utf-8').split('\n')
                     ^^^^^^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/subprocess.py", line 466, in check_output
     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/subprocess.py", line 571, in run
     raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['objdump', '-d', '-z', '--start-address=0xffff80008125b758', '--stop-address=0xffff80008125a934', '../../vmlinux']' returned non-zero exit status 1.
Fatal Python error: handler_call_die: problem in Python trace event handler
Python runtime state: initialized

Current thread 0x0000ffffb90d54e0 (most recent call first):
   <no Python frame>

Extension modules: perf_trace_context, systemd._journal, systemd._reader, systemd.id128, report._py3report, _dbus_bindings, problem._py3abrt (total: 7)
Aborted (core dumped)


dump snippet:
============
Event type: branches
Sample = { cpu: 0001 addr: 0xffff80008030cb00 phys_addr: 0x0000000000000000 ip: 0xffff800080313f0c pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff800080313f04 <__perf_event_header__init_id+0x4c>:
         ffff800080313f04:       36100094        tbz     w20, #2, ffff800080313f14 <__perf_event_header__init_id+0x5c>
         ffff800080313f08:       f941e6a0        ldr     x0, [x21, #968]
         ffff800080313f0c:       d63f0000        blr     x0
             perf 12720/12720 [0001]      5986.372298040 __perf_event_header__init_id+0x54 .../coresight/linux/kernel/events/core.c  586         return event->clock();
Event type: branches
Sample = { cpu: 0001 addr: 0xffff8000801bb4a8 phys_addr: 0x0000000000000000 ip: 0xffff80008030cb0c pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff80008030cb00 <local_clock>:
         ffff80008030cb00:       d503233f        paciasp
         ffff80008030cb04:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
         ffff80008030cb08:       910003fd        mov     x29, sp
         ffff80008030cb0c:       97faba67        bl ffff8000801bb4a8 <sched_clock>
             perf 12720/12720 [0001]      5986.372298040 local_clock+0xc ...t/linux/./include/linux/sched/clock.h   64 return sched_clock();
Event type: branches
Sample = { cpu: 0001 addr: 0xffff80008125a8a8 phys_addr: 0x0000000000000000 ip: 0xffff8000801bb4c8 pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff8000801bb4a8 <sched_clock>:
         ffff8000801bb4a8:       d503233f        paciasp
         ffff8000801bb4ac:       a9be7bfd        stp     x29, x30, [sp, #-32]!
         ffff8000801bb4b0:       910003fd        mov     x29, sp
         ffff8000801bb4b4:       a90153f3        stp     x19, x20, [sp, #16]
         ffff8000801bb4b8:       d5384113        mrs     x19, sp_el0
         ffff8000801bb4bc:       b9401260        ldr     w0, [x19, #16]
         ffff8000801bb4c0:       11000400        add     w0, w0, #0x1
         ffff8000801bb4c4:       b9001260        str     w0, [x19, #16]
         ffff8000801bb4c8:       94427cf8        bl ffff80008125a8a8 <sched_clock_noinstr>
             perf 12720/12720 [0001]      5986.372298040 sched_clock+0x20 ...sight/linux/kernel/time/sched_clock.c  105 ns = sched_clock_noinstr();
Event type: branches
Sample = { cpu: 0001 addr: 0xffff80008125b758 phys_addr: 0x0000000000000000 ip: 0xffff80008125a8e4 pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff80008125a8a8 <sched_clock_noinstr>:
         ffff80008125a8a8:       d503233f        paciasp
         ffff80008125a8ac:       a9bc7bfd        stp     x29, x30, [sp, #-64]!
         ffff80008125a8b0:       910003fd        mov     x29, sp
         ffff80008125a8b4:       a90153f3        stp     x19, x20, [sp, #16]
         ffff80008125a8b8:       b000e354        adrp    x20, ffff800082ec3000 <tick_bc_dev+0x140>
         ffff80008125a8bc:       910d0294        add     x20, x20, #0x340
         ffff80008125a8c0:       a90363f7        stp     x23, x24, [sp, #48]
         ffff80008125a8c4:       91002297        add     x23, x20, #0x8
         ffff80008125a8c8:       52800518        mov     w24, #0x28                  // #40
         ffff80008125a8cc:       a9025bf5        stp     x21, x22, [sp, #32]
         ffff80008125a8d0:       b9400296        ldr     w22, [x20]
         ffff80008125a8d4:       120002d5        and     w21, w22, #0x1
         ffff80008125a8d8:       9bb87eb5        umull   x21, w21, w24
         ffff80008125a8dc:       8b1502f3        add     x19, x23, x21
         ffff80008125a8e0:       f9400e60        ldr     x0, [x19, #24]
         ffff80008125a8e4:       d63f0000        blr     x0
             perf 12720/12720 [0001]      5986.372298040 sched_clock_noinstr+0x3c ...sight/linux/kernel/time/sched_clock.c 93                 cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
Event type: branches
Sample = { cpu: 0001 addr: 0xffff8000801bb4cc phys_addr: 0x0000000000000000 ip: 0xffff80008125a930 pid: 12720 tid: 12720 period: 1 time: 5986372298040 }


With fix:
=========

Event type: branches
Sample = { cpu: 0001 addr: 0xffff80008030cb00 phys_addr: 0x0000000000000000 ip: 0xffff800080313f0c pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff800080313f04 <__perf_event_header__init_id+0x4c>:
         ffff800080313f04:       36100094        tbz     w20, #2, ffff800080313f14 <__perf_event_header__init_id+0x5c>
         ffff800080313f08:       f941e6a0        ldr     x0, [x21, #968]
         ffff800080313f0c:       d63f0000        blr     x0
             perf 12720/12720 [0001]      5986.372298040 __perf_event_header__init_id+0x54 .../coresight/linux/kernel/events/core.c  586         return event->clock();
Event type: branches
Sample = { cpu: 0001 addr: 0xffff8000801bb4a8 phys_addr: 0x0000000000000000 ip: 0xffff80008030cb0c pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff80008030cb00 <local_clock>:
         ffff80008030cb00:       d503233f        paciasp
         ffff80008030cb04:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
         ffff80008030cb08:       910003fd        mov     x29, sp
         ffff80008030cb0c:       97faba67        bl ffff8000801bb4a8 <sched_clock>
             perf 12720/12720 [0001]      5986.372298040 local_clock+0xc ...t/linux/./include/linux/sched/clock.h   64 return sched_clock();
Event type: branches
Sample = { cpu: 0001 addr: 0xffff80008125a8a8 phys_addr: 0x0000000000000000 ip: 0xffff8000801bb4c8 pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff8000801bb4a8 <sched_clock>:
         ffff8000801bb4a8:       d503233f        paciasp
         ffff8000801bb4ac:       a9be7bfd        stp     x29, x30, [sp, #-32]!
         ffff8000801bb4b0:       910003fd        mov     x29, sp
         ffff8000801bb4b4:       a90153f3        stp     x19, x20, [sp, #16]
         ffff8000801bb4b8:       d5384113        mrs     x19, sp_el0
         ffff8000801bb4bc:       b9401260        ldr     w0, [x19, #16]
         ffff8000801bb4c0:       11000400        add     w0, w0, #0x1
         ffff8000801bb4c4:       b9001260        str     w0, [x19, #16]
         ffff8000801bb4c8:       94427cf8        bl ffff80008125a8a8 <sched_clock_noinstr>
             perf 12720/12720 [0001]      5986.372298040 sched_clock+0x20 ...sight/linux/kernel/time/sched_clock.c  105 ns = sched_clock_noinstr();
Event type: branches
Sample = { cpu: 0001 addr: 0xffff80008125b758 phys_addr: 0x0000000000000000 ip: 0xffff80008125a8e4 pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff80008125a8a8 <sched_clock_noinstr>:
         ffff80008125a8a8:       d503233f        paciasp
         ffff80008125a8ac:       a9bc7bfd        stp     x29, x30, [sp, #-64]!
         ffff80008125a8b0:       910003fd        mov     x29, sp
         ffff80008125a8b4:       a90153f3        stp     x19, x20, [sp, #16]
         ffff80008125a8b8:       b000e354        adrp    x20, ffff800082ec3000 <tick_bc_dev+0x140>
         ffff80008125a8bc:       910d0294        add     x20, x20, #0x340
         ffff80008125a8c0:       a90363f7        stp     x23, x24, [sp, #48]
         ffff80008125a8c4:       91002297        add     x23, x20, #0x8
         ffff80008125a8c8:       52800518        mov     w24, #0x28                  // #40
         ffff80008125a8cc:       a9025bf5        stp     x21, x22, [sp, #32]
         ffff80008125a8d0:       b9400296        ldr     w22, [x20]
         ffff80008125a8d4:       120002d5        and     w21, w22, #0x1
         ffff80008125a8d8:       9bb87eb5        umull   x21, w21, w24
         ffff80008125a8dc:       8b1502f3        add     x19, x23, x21
         ffff80008125a8e0:       f9400e60        ldr     x0, [x19, #24]
         ffff80008125a8e4:       d63f0000        blr     x0

It looks like the disassembly now assumes this BLR wasn't taken. We go from ffff80008125a8e4 straight through to ...

             perf 12720/12720 [0001]      5986.372298040 sched_clock_noinstr+0x3c ...sight/linux/kernel/time/sched_clock.c 93                 cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
Event type: branches
Sample = { cpu: 0001 addr: 0xffff8000801bb4cc phys_addr: 0x0000000000000000 ip: 0xffff80008125a930 pid: 12720 tid: 12720 period: 1 time: 5986372298040 }
         ffff80008125a8e8 <sched_clock_noinstr+0x40>:
         ffff80008125a8e8:       f8756ae3        ldr     x3, [x23, x21]

ffff80008125a8e4 which is just the previous one +4. Isn't your issue actually a decode issue in Perf itself? Why is there a discontinuity without branch samples being generated where either the source or destination address is 0?

What are your record options to create this issue? As I mentioned in the previous reply I haven't been able to reproduce it.

I am using below perf record command.

timeout 4s ./perf record -e cs_etm// -C 1 dd if=/dev/zero of=/dev/null


Thanks I managed to reproduce it. I'll take a look to see if I think the issue is somewhere else.


At least for the failures I encountered, the issue is due to the alternatives runtime instruction patching mechanism. vmlinux ends up being the wrong image to decode with because a load of branches are actually turned into nops.

Can you confirm if you use --kcore instead of vmlinux that you still get failures:

   sudo perf record -e cs_etm// -C 1 --kcore -o <output-folder.data> -- \
     dd if=/dev/zero of=/dev/null

    perf script -i <output-folder.data> \
     tools/perf/scripts/python/arm-cs-trace-disasm.py -d llvm-objdump \
     -k <output-folder.data>/kcore_dir/kcore


With below command combination with kcore also the issue is seen, as reported in this email chain.

timeout 8s ./perf record -e cs_etm// -C 1 --kcore -o kcore \
dd if=/dev/zero of=/dev/null

./perf script -i kcore/data \
--script=python:./scripts/python/arm-cs-trace-disasm.py -- \
-d objdump -k kcore/kcore_dir/kcore


However, with below sequence(same as your command) the issue is *not* seen.

timeout 8s ./perf record -e cs_etm// -C 1 --kcore -o kcore \
dd if=/dev/zero of=/dev/null

./perf script -i kcore/data ./scripts/python/arm-cs-trace-disasm.py \
-- -d objdump -k kcore/kcore_dir/kcore

Do you see any issue with the command, which is showing the problem?
Also the output log produced by these both commands is different.


BTW are you running this on the target or somewhere else? It's suspicious that "-i kcore/data" works at all because there is no kernel image given to Perf. Unless you are running on the target and then I think it will just open the one from /proc. Or maybe it uses /boot/vmlinux by default which also wouldn't work.

Also the difference between "--script=python:" and just giving the script name is in the parsing of the arguments following " -- ". Sometimes they're also parsed as Perf arguments (like the -v becomes perf verbose and -k becomes the Perf vmlinux rather than the script).

I _think_ you want the " -- " when "--script" is used, and no "--" when it's not. But there are some other combinations and you'll have to debug it to compare your two exact scenarios to see why they're different.

But ignoring that issue with the argument format, you mentioned you didn't see the issue any more with one version of --kcore. So I'm assuming that confirms the issue is just a decode image issue, so we shouldn't try to patch this script?


Although one change we should make to the script is change the example to use kcore. We can leave in one vmlinux one as an example if kcore isn't available, but add a note that it will fail if any patched code is traced (which is almost always).

And make the other fixes to OpenCSD to stop it from making samples that go backwards. That will fix the hard exit on the script and turn it into a regular descynchronise.


Double check the command I gave. "-i" needs to be the same as "-o" (it's the folder, not the data file). I think this could be causing your issue. Unless you give it the folder it doesn't open kcore along with the data file.


As per 'perf script --help'

        -i, --input=
            Input file name. (default: perf.data unless stdin is a fifo)


That could probably say "file name, or folder when --kcore is used", if you mean that you think it's not accurate?

But when you use --kcore the default folder (not file) name is still perf.data, so the default argument gives you a clue that you're not supposed to descend into the folder.

Also tried just giving dir as you suggested and still the same.

./perf script -i kcore --script=python:./scripts/python/arm-cs-trace-disasm.py -- -d objdump -k kcore/kcore_dir/kcore

The below diff that you have shared has no effect on the failing case.

But I still think bad decode detection should be moved as much as possible into OpenCSD and Perf rather than this script. Otherwise every tool will have to re-implement it, and OpenCSD has a lot more info to make decisions with.

One change we can make is to desynchronize when an N atom is an unconditional branch:

  diff --git a/decoder/source/etmv4/trc_pkt_decode_etmv4i.cpp b/decoder/source/etmv4/trc_pkt_decode_etmv4i.cpp
index c557998..3eefd5d 100644
--- a/decoder/source/etmv4/trc_pkt_decode_etmv4i.cpp
+++ b/decoder/source/etmv4/trc_pkt_decode_etmv4i.cpp
@@ -1341,6 +1341,14 @@ ocsd_err_t TrcPktDecodeEtmV4I::processAtom(const ocsd_atm_val atom)
          //  save recorded next instuction address
          ocsd_vaddr_t nextAddr = m_instr_info.instr_addr;

+        // must have lost sync if an unconditional branch wasn't taken
+        if (atom == ATOM_N && !m_instr_info.is_conditional) {
+             m_need_addr = true;
+             m_out_elem.addElemType(m_index_curr_pkt, OCSD_GEN_TRC_ELEM_NO_SYNC);
+             // wait for next address
+             return OCSD_OK;
+        }
+

Another one we can spot is when a new address comes that is before the current decode address (basically the backwards check that you added).

There are probably others that can be spotted like an address appearing after a direct branch that doesn't match the branch target.

I think at that point, desynchronising should cause the disassembly script to throw away the last bit, rather than force it to be printed as in this patch. As I mentioned above in the thread, it leads to printing disassembly that's implausible and misleading (where an unconditional branch wasn't taken).


Thanks,
Ganapat