Re: [PATCH 1/2] tracing: Add a trace for task_exit
From: Eric W. Biederman
Date: Mon May 03 2021 - 15:03:08 EST
<Peter.Enderborg@xxxxxxxx> writes:
> On 5/3/21 6:30 PM, Eric W. Biederman wrote:
>> <Peter.Enderborg@xxxxxxxx> writes:
>>
>>> On 5/3/21 3:50 PM, Mathieu Desnoyers wrote:
>>>> ----- On May 1, 2021, at 9:11 AM, rostedt rostedt@xxxxxxxxxxx wrote:
>>>>
>>>>> On Sat, 1 May 2021 09:29:41 +0000
>>>>> <Peter.Enderborg@xxxxxxxx> wrote:
>>>>>
>>>>>> On 4/30/21 7:48 PM, Eric W. Biederman wrote:
>>>>>>> Peter Enderborg <peter.enderborg@xxxxxxxx> writes:
>>>>>>>
>>>>>>>> This is the peer functions to task_rename and task_newtask.
>>>>>>>> With this we get hole "life-cycle" of task and can easily
>>>>>>>> see short livied task and their exit status.
>>>>>>> This patch is incorrect. The location you are dealing with is not part
>>>>>>> of task exit. The location you have instrumented is part of reaping a
>>>>>>> task which can come arbitrarily long after the task exits.
>>>>>> That is what it aiming. When using this as tool for userspace you
>>>>>> would like to know when the task is done. When it no longer
>>>>>> holds any thing that might have any impact. If you think the
>>>>>> exit imply something more specific I can change the name.
>>>>>>
>>>>>> I thought exit was a good name, it is in in exit.c.
>>>>>>
>>>>>> Will the name task_done, task_finished or task_reaped work for you?
>>>>> I think "task_reaped" is probably the best name, and the most
>>>>> descriptive of what happened.
>>>> What would it provide that is not already available through the "sched_process_free"
>>>> tracepoint in delayed_put_task_struct ?
>>> For task_exit (or task_reaped)
>>>
>>> field:pid_t pid; offset:8; size:4; signed:1;
>>> field:short oom_score_adj; offset:12; size:2; signed:1;
>>> field:int exit_signal; offset:16; size:4; signed:1;
>>> field:int exit_code; offset:20; size:4; signed:1;
>>> field:int exit_state; offset:24; size:4; signed:1;
>>> field:__data_loc char[] comm; offset:28; size:4; signed:1;
>>>
>>> sched_process_free
>>> field:char comm[16]; offset:8; size:16; signed:1;
>>> field:pid_t pid; offset:24; size:4; signed:1;
>>> field:int prio; offset:28; size:4; signed:1;
>>>
>>> So information about oom_score_adj, and it's exit parameters.
>>
>> For the record returning oom_score_adj that late is not appropriate for
>> any kernel/user API. It is perfectly valid for the kernel to optimize
>> out anything that wait(2) does not return.
>>
>> If you want oom_score_adj you probably need to sample it in
>> sched_process_exit.
> That I don't understand why? oom_score_adj is part of the signal,
> why is that not intact when we run __exit_signal ?
Yes oom_score_adj lives in struct signal. The naming of __exit_signal
is simply historical at this point, not descriptive.
The function of oom_score_adj is to tell the oom kill how aggressive to
be when oom_killing functions. That stops being relevant as soon as
PF_EXITING gets set.
An optimization I have toyed with that would be completely relevant
is to have a very minimal struct zombie that would contain just the
information that wait(2) needs. Everything else about the process
can be freed when the process actually stops running.
It would make no sense to include oom_score_adj in such a struct zombie.
As such it makes it very bad choice to place oom_score_adj userspace API
that triggers when a task is reaped.
>> I periodically move things from the point a process is reaped to the
>> point where a task stops running, for both correctness and for simpler
>> maintenance. When threads were added a bunch of cleanup was added
>> to the wrong place. I certainly would not hesitate to mess with
>> oom_score_adj if changing something would make the code simpler.
>>
>> With both sched_process_free and sched_process_exit it looks like we
>> already have tracepoints everywhere they could be needed.
>> task exit.
>
> It might be where we it is needed, but it does not contain information that
> are needed for userspace. I don't see this as tool for sched issues,
> but ading information to existing ones is of course a option.
>
> However current traces is template based, and I assume it wont be
> popular to add new fields to the template, and exit reasons is not
> right for the other template use cases.
>
> I still see a "new" task moving it to do_exit make trace name more
> correct? Or is trace_task_do_exit better?
I really can't say, as I don't know much of anything about the tracing
infrastructure. I would assume in most cases with a tracepoint in place
other kinds of tracing (like bpf programs) could come into play and read
out pieces of information that are not commonly wanted.
All I really know something about is the exit code path, as I keep
slowly trying to clean it up. I plan on ignoring any tracepoint that
makes that gets in the way.
Eric