Re: [PATCH 1/1] perf trace: If a syscall arg is marked as 'const', assume it is coming _from_ userspace

From: Ian Rogers
Date: Tue Sep 10 2024 - 13:43:40 EST


On Tue, Sep 10, 2024 at 10:05 AM <arnaldo.melo@xxxxxxxxx> wrote:
>
> We need to decide where to copy syscall arg contents, if at the
> syscalls:sys_entry hook, meaning is something that is coming from
> user to kernel space, or if it is a response, i.e. if it is something
> the _kernel_ is filling in and thus going to userspace.
>
> Since we have 'const' used in those syscalls, and unsure about this
> being consistent, doing:
>
> root@number:~# echo $(grep const /sys/kernel/tracing/events/syscalls/sys_enter_*/format | grep struct | cut -c47- | cut -d'/' -f1)
> clock_nanosleep clock_settime epoll_pwait2 futex io_pgetevents landlock_create_ruleset listmount mq_getsetattr mq_notify mq_timedreceive mq_timedsend preadv2 preadv prlimit64 process_madvise process_vm_readv process_vm_readv process_vm_writev process_vm_writev pwritev2 pwritev readv rt_sigaction rt_sigtimedwait semtimedop statmount timerfd_settime timer_settime vmsplice writev
> root@number:~#
>
> Seems to indicate that we can use that for the ones that have the
> 'const' to mark it as coming from user space, do it.
>
> Most notable/frequent syscall that now gets BTF pretty printed in a
> system wide 'perf trace' session is:
>
> root@number:~# perf trace
> 21.160 ( ): MediaSu~isor #/1028597 futex(uaddr: 0x7f49e1dfe964, op: WAIT_BITSET|PRIVATE_FLAG, utime: (struct __kernel_timespec){.tv_sec = (__kernel_time64_t)50290,.tv_nsec = (long long int)810362837,}, val3: MATCH_ANY) ...
> 21.166 ( 0.000 ms): RemVidChild/6995 futex(uaddr: 0x7f49fcc7fa00, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> 21.169 ( 0.001 ms): RemVidChild/6995 sendmsg(fd: 25<socket:[78915]>, msg: 0x7f49e9af9da0, flags: DONTWAIT) = 280
> 21.172 ( 0.289 ms): RemVidChild/6995 futex(uaddr: 0x7f49fcc7fa58, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, val3: MATCH_ANY) = 0
> 21.463 ( 0.000 ms): RemVidChild/6995 futex(uaddr: 0x7f49fcc7fa00, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> 21.467 ( 0.001 ms): RemVidChild/6995 futex(uaddr: 0x7f49e28bb964, op: WAKE|PRIVATE_FLAG, val: 1) = 1
> 21.160 ( 0.314 ms): MediaSu~isor #/1028597 ... [continued]: futex()) = 0
> 21.469 ( ): RemVidChild/6995 futex(uaddr: 0x7f49fcc7fa5c, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, val3: MATCH_ANY) ...
> 21.475 ( 0.000 ms): MediaSu~isor #/1028597 futex(uaddr: 0x7f49d0223040, op: WAKE|PRIVATE_FLAG, val: 1) = 0
> 21.478 ( 0.001 ms): MediaSu~isor #/1028597 futex(uaddr: 0x7f49e26ac964, op: WAKE|PRIVATE_FLAG, val: 1) = 1
> ^Croot@number:~#
> root@number:~# cat /sys/kernel/tracing/events/syscalls/sys_enter_futex/format
> name: sys_enter_futex
> ID: 454
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:int __syscall_nr; offset:8; size:4; signed:1;
> field:u32 * uaddr; offset:16; size:8; signed:0;
> field:int op; offset:24; size:8; signed:0;
> field:u32 val; offset:32; size:8; signed:0;
> field:const struct __kernel_timespec * utime; offset:40; size:8; signed:0;
> field:u32 * uaddr2; offset:48; size:8; signed:0;
> field:u32 val3; offset:56; size:8; signed:0;
>
> print fmt: "uaddr: 0x%08lx, op: 0x%08lx, val: 0x%08lx, utime: 0x%08lx, uaddr2: 0x%08lx, val3: 0x%08lx", ((unsigned long)(REC->uaddr)), ((unsigned long)(REC->op)), ((unsigned long)(REC->val)), ((unsigned long)(REC->utime)), ((unsigned long)(REC->uaddr2)), ((unsigned long)(REC->val3))
> root@number:~#
>
> Suggested-by: Ian Rogers <irogers@xxxxxxxxxx>
> Cc: Adrian Hunter <adrian.hunter@xxxxxxxxx>
> Cc: Alan Maguire <alan.maguire@xxxxxxxxxx>
> Cc: Howard Chu <howardchu95@xxxxxxxxx>
> Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
> Cc: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
> Cc: Namhyung Kim <namhyung@xxxxxxxxxx>
> Signed-off-by: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>

Reviewed-by: Ian Rogers <irogers@xxxxxxxxxx>

Thanks,
Ian

> ---
> tools/perf/builtin-trace.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index 83eb15b72333edd9..e26baf1256d5bb56 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -2001,11 +2001,14 @@ syscall_arg_fmt__init_array(struct syscall_arg_fmt *arg, struct tep_format_field
>
> len = strlen(field->name);
>
> + // As far as heuristics (or intention) goes this seems to hold true, and makes sense!
> + if ((field->flags & TEP_FIELD_IS_POINTER) && strstarts(field->type, "const "))
> + arg->from_user = true;
> +
> if (strcmp(field->type, "const char *") == 0 &&
> ((len >= 4 && strcmp(field->name + len - 4, "name") == 0) ||
> strstr(field->name, "path") != NULL)) {
> arg->scnprintf = SCA_FILENAME;
> - arg->from_user = true;
> } else if ((field->flags & TEP_FIELD_IS_POINTER) || strstr(field->name, "addr"))
> arg->scnprintf = SCA_PTR;
> else if (strcmp(field->type, "pid_t") == 0)
> --
> 2.46.0
>