Re: [PATCH] exit: add trace_task_exit() tracepoint before current->mm is reset

From: Andrii Nakryiko
Date: Tue Apr 01 2025 - 18:04:35 EST


On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Tue, 1 Apr 2025 11:40:21 -0700
> Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> Hi Andrii,
>
> > It is useful to be able to access current->mm to, say, record a bunch of
> > VMA information right before the task exits (e.g., for stack
> > symbolization reasons when dealing with short-lived processes that exit
> > in the middle of profiling session). We currently do have
> > trace_sched_process_exit() in the exit path, but it is called a bit too
> > late, after exit_mm() resets current->mm to NULL, which makes it
> > unsuitable for inspecting and recording task's mm_struct-related data
> > when tracing process lifetimes.
>
> My fear of adding another task exit trace event is that it will get a
> bit confusing as that we now have trace_sched_process_exit() and also
> trace_task_exit() with slightly different semantics.
>
> How about adding a trace_exit_mm()? Add that to the exit_mm() code?

This is kind of the worst of both worlds, no? We still have a new
tracepoint, but this one can't tell if it's a `group_dead` situation
or not... I can pass group_dead into exit_mm(), but it will be just
for the sake of that new tracepoint.

How bad would it be to just move trace_sched_process_exit() then? (and
add group_dead there, as you mentioned)?

>
> static void exit_mm(void)
> {
> struct mm_struct *mm = current->mm;
>
> exit_mm_release(current, mm);
> trace_exit_mm(mm);
>
> ??
>
> >
> > There is a particularly suitable place, though, right after
> > taskstats_exit() is called, but before we do exit_mm(). taskstats
> > performs a similar kind of accounting that some applications do with
> > BPF, and so co-locating them seems like a good fit.
> >
> > Moving trace_sched_process_exit() a bit earlier would solve this problem
> > as well, and I'm open to that. But this might potentially change its
> > semantics a little, and so instead of risking that, I went for adding
> > a new trace_task_exit() tracepoint instead. Tracepoints have zero
> > overhead at runtime, unless actively traced, so this seems acceptable.
> >
> > Also, existing trace_sched_process_exit() tracepoint is notoriously
> > missing `group_dead` flag that is certainly useful in practice and some
> > of our production applications have to work around this. So plumb
> > `group_dead` through while at it, to have a richer and more complete
> > tracepoint.
>
> There should be no problem adding group_dead to the
> trace_sched_process_exit() trace event. Adding fields should never cause
> any user API breakage.
>
> -- Steve
>