Re: [PATCH 1/2] tracing: Add a trace for task_exit

From: Peter.Enderborg
Date: Mon May 03 2021 - 15:49:52 EST


On 5/3/21 9:02 PM, Eric W. Biederman wrote:
> <Peter.Enderborg@xxxxxxxx> writes:
>
>> On 5/3/21 6:30 PM, Eric W. Biederman wrote:
>>> <Peter.Enderborg@xxxxxxxx> writes:
>>>
>>>> On 5/3/21 3:50 PM, Mathieu Desnoyers wrote:
>>>>> ----- On May 1, 2021, at 9:11 AM, rostedt rostedt@xxxxxxxxxxx wrote:
>>>>>
>>>>>> On Sat, 1 May 2021 09:29:41 +0000
>>>>>> <Peter.Enderborg@xxxxxxxx> wrote:
>>>>>>
>>>>>>> On 4/30/21 7:48 PM, Eric W. Biederman wrote:
>>>>>>>> Peter Enderborg <peter.enderborg@xxxxxxxx> writes:
>>>>>>>>
>>>>>>>>> This is the peer functions to task_rename and task_newtask.
>>>>>>>>> With this we get hole "life-cycle" of task and can easily
>>>>>>>>> see short livied task and their exit status.
>>>>>>>> This patch is incorrect. The location you are dealing with is not part
>>>>>>>> of task exit. The location you have instrumented is part of reaping a
>>>>>>>> task which can come arbitrarily long after the task exits.
>>>>>>> That is what it aiming. When using this as tool for userspace you
>>>>>>> would like to know when the task is done. When it no longer
>>>>>>> holds any thing that might have any impact. If you think the
>>>>>>> exit imply something more specific I can change the name.
>>>>>>>
>>>>>>> I thought exit was a good name, it is in in exit.c.
>>>>>>>
>>>>>>> Will the name task_done, task_finished or task_reaped work for you?
>>>>>> I think "task_reaped" is probably the best name, and the most
>>>>>> descriptive of what happened.
>>>>> What would it provide that is not already available through the "sched_process_free"
>>>>> tracepoint in delayed_put_task_struct ?
>>>> For task_exit (or task_reaped)
>>>>
>>>>         field:pid_t pid;        offset:8;       size:4; signed:1;
>>>>         field:short oom_score_adj;      offset:12;      size:2; signed:1;
>>>>         field:int exit_signal;  offset:16;      size:4; signed:1;
>>>>         field:int exit_code;    offset:20;      size:4; signed:1;
>>>>         field:int exit_state;   offset:24;      size:4; signed:1;
>>>>         field:__data_loc char[] comm;   offset:28;      size:4; signed:1;
>>>>
>>>> sched_process_free
>>>>         field:char comm[16];    offset:8;       size:16;        signed:1;
>>>>         field:pid_t pid;        offset:24;      size:4; signed:1;
>>>>         field:int prio; offset:28;      size:4; signed:1;
>>>>
>>>> So information about oom_score_adj, and it's exit parameters.
>>> For the record returning oom_score_adj that late is not appropriate for
>>> any kernel/user API. It is perfectly valid for the kernel to optimize
>>> out anything that wait(2) does not return.
>>>
>>> If you want oom_score_adj you probably need to sample it in
>>> sched_process_exit.
>> That I don't understand why?  oom_score_adj is part of the signal,
>> why is that not intact when we run __exit_signal ?
> Yes oom_score_adj lives in struct signal. The naming of __exit_signal
> is simply historical at this point, not descriptive.
>
> The function of oom_score_adj is to tell the oom kill how aggressive to
> be when oom_killing functions. That stops being relevant as soon as
> PF_EXITING gets set.

Still it relevant for the case it was oom that killed a task or not. As I see it
it must be valid until the pid is removed in proc_flush_pid.

If it was part of a oom or not is different matter.  I have a oom trace
too, but I have not sent that yet.

> An optimization I have toyed with that would be completely relevant
> is to have a very minimal struct zombie that would contain just the
> information that wait(2) needs. Everything else about the process
> can be freed when the process actually stops running.
>
> It would make no sense to include oom_score_adj in such a struct zombie.

But still you can do a call to the task_reaped in the fast path too?


> As such it makes it very bad choice to place oom_score_adj userspace API
> that triggers when a task is reaped.

if it is relevant for it's creation and renaming surely it must be
more relevant for it's killing since it is part of it's oom? However
this patch is not about oom, the oom_score_adj are used
as a prioritization in android when task are recycled. Having at as trace
parameter make it easy to monitor the activity among services.

Just select task 0-250 and you get the applications and services
that is currently used.


>
>>> I periodically move things from the point a process is reaped to the
>>> point where a task stops running, for both correctness and for simpler
>>> maintenance. When threads were added a bunch of cleanup was added
>>> to the wrong place. I certainly would not hesitate to mess with
>>> oom_score_adj if changing something would make the code simpler.
>>>
>>> With both sched_process_free and sched_process_exit it looks like we
>>> already have tracepoints everywhere they could be needed.
>>> task exit.
>> It might be where we it is needed, but it does not contain information that
>> are needed for userspace. I don't see this as tool for sched issues,
>> but ading information to existing ones is of course a option.
>>
>> However current traces is template based, and I assume it wont be
>> popular to add new fields to the template, and exit reasons is not
>> right for the other template use cases.
>>
>> I still see a "new" task moving it to do_exit make trace name more
>> correct?  Or is trace_task_do_exit better?
> I really can't say, as I don't know much of anything about the tracing
> infrastructure. I would assume in most cases with a tracepoint in place
> other kinds of tracing (like bpf programs) could come into play and read
> out pieces of information that are not commonly wanted.
>
> All I really know something about is the exit code path, as I keep
> slowly trying to clean it up. I plan on ignoring any tracepoint that
> makes that gets in the way.
>
> Eric