Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state

From: Eric W. Biederman
Date: Sat Jun 25 2022 - 12:35:42 EST


Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes:

> On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
>> Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes:
>>
>> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
>> >> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> >>
>> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
>> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
>> >> state exists only in task->__state and nowhere else.
>> >>
>> >> There's two spots of bother with this:
>> >>
>> >> - PREEMPT_RT has task->saved_state which complicates matters,
>> >> meaning task_is_{traced,stopped}() needs to check an additional
>> >> variable.
>> >>
>> >> - An alternative freezer implementation that itself relies on a
>> >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will
>> >> result in misbehaviour.
>> >>
>> >> As such, add additional state to task->jobctl to track this state
>> >> outside of task->__state.
>> >>
>> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
>> >>
>> >> --EWB
>> >> * didn't add a unnecessary newline in signal.h
>> >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
>> >> instead of in signal_wake_up_state. This prevents the clearing
>> >> of TASK_STOPPED and TASK_TRACED from getting lost.
>> >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
>> >
>> > Hi Eric, Peter,
>> >
>> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
>> > kill_child testcase from strace tool is repeatedly used (the source
>> > is attached for reference):
>> >
>> > while :; do
>> > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
>> > done
>> >
>> > It normally takes few minutes to cause the warning in -rc3, but FWIW
>> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
>> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
>> >
>> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
>> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
>> > observe a child in __TASK_TRACED state. Could you please comment here?
>> >
>>
>> For clarity the warning is that the child is not in __TASK_TRACED state.
>>
>> The code is waiting for the code to stop in the scheduler in the
>> __TASK_TRACED state so that it can safely read and change the
>> processes state. Some of that state is not even saved until the
>> process is scheduled out so we have to wait until the process
>> is stopped in the scheduler.
>
> So I assume (checked actually) the return 0 below from kernel/sched/core.c:
> wait_task_inactive() is where it bails out:
>
> 3303 while (task_running(rq, p)) {
> 3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> 3305 return 0;
> 3306 cpu_relax();
> 3307 }
>
> Yet, the child task is always found in __TASK_TRACED state (as seen
> in crash dumps):
>
>> 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace
> 101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child
> 108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child
> crash> task bb04b200 __state
> PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child"
> __state = 8,
>
> crash> task d0b10100 __state
> PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child"
> __state = 8,
>

I haven't gotten as far as reproducing this but I have started giving
this issue some thought.

This entire thing smells like a memory barrier is missing somewhere.
However by definition the lock implementations in linux provide all the
needed memory barriers, and in the ptrace_stop and ptrace_check_attach
path I don't see cases where these values are sampled outside of a lock
except in wait_task_inactive. Does doing that perhaps require a
barrier?

The two things I can think of that could shed light on what is going on
is enabling lockdep, to enable the debug check in signal_wake_up_state
and verifying bits of state that should be constant while the task
is frozen for ptrace are indeed constant when task is frozen for ptrace.
Something like my patch below.

If you could test that when you have a chance that would help narrow
down what is going on.

Thank you,
Eric

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 156a99283b11..6467a2b1c3bc 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -268,9 +268,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
}
read_unlock(&tasklist_lock);

- if (!ret && !ignore_state &&
- WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
+ if (!ret && !ignore_state) {
+ WARN_ON_ONCE(!(child->jobctl & JOBCTL_PTRACE_FROZEN));
+ WARN_ON_ONCE(!(child->joctctl & JOBCTL_TRACED));
+ WARN_ON_ONCE(READ_ONCE(child->__state) != __TASK_TRACED);
+ WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED));
ret = -ESRCH;
+ }

return ret;
}