Re: [PATCH 5/6] coredump: Don't perform any cleanups before dumping core

From: Kees Cook
Date: Fri Sep 24 2021 - 14:51:34 EST


On Thu, Sep 23, 2021 at 07:11:39PM -0500, Eric W. Biederman wrote:
>
> Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
> before PTRACE_EVENT_EXIT, and before any cleanup work for a task
> happens. This ensures that an accurate copy of the process can be
> captured in the coredump as no cleanup for the process happens before
> the coredump completes. This also ensures that PTRACE_EVENT_EXIT
> will not be visited by any thread until the coredump is complete.
>
> Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
> coredump_task_exit can be recognized and ignored in zap_process.
>
> Now that all of the coredumping happens before exit_mm remove code to
> test for a coredump in progress from mm_release.
>
> Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
> The other tests in may_ptrace_stop all concern avoiding stopping
> during a coredump. These tests are no longer necessary as it is now
> guaranteed that fatal_signal_pending will be set if the code enters
> ptrace_stop during a coredump. The code in ptrace_stop is guaranteed
> not to stop if fatal_signal_pending returns true.
>
> Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
> ptrace_stop without fatal_signal_pending being true, as signals are
> dequeued in get_signal before calling do_exit. This is no longer
> an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
> until after the coredump completes.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> ---
> fs/coredump.c | 8 ++++----
> include/linux/sched.h | 1 +
> kernel/exit.c | 19 ++++++++++++-------
> kernel/fork.c | 3 +--
> kernel/signal.c | 27 +--------------------------
> mm/oom_kill.c | 2 +-
> 6 files changed, 20 insertions(+), 40 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 5e0e08a7fb9b..d576287fb88b 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -359,7 +359,7 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
>
> for_each_thread(start, t) {
> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> - if (t != current && t->mm) {
> + if (t != current && !(t->flags & PF_POSTCOREDUMP)) {

PF_POSTCOREDUMP is "my exit path is done with anything associated with
coredumping, including not having dumped core", yes? i.e. it's a task
lifetime mark, not a "did this actually dump core" mark?

> sigaddset(&t->pending.signal, SIGKILL);
> signal_wake_up(t, 1);
> nr++;
> @@ -404,8 +404,8 @@ static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
> *
> * do_exit:
> * The caller holds mm->mmap_lock. This means that the task which
> - * uses this mm can't pass coredump_exit_mm(), so it can't exit or
> - * clear its ->mm.
> + * uses this mm can't pass coredump_task_exit(), so it can't exit
> + * or clear its ->mm.
> *
> * de_thread:
> * It does list_replace_rcu(&leader->tasks, &current->tasks),
> @@ -500,7 +500,7 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
> next = curr->next;
> task = curr->task;
> /*
> - * see coredump_exit_mm(), curr->task must not see
> + * see coredump_task_exit(), curr->task must not see
> * ->task == NULL before we read ->next.
> */
> smp_mb();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..f3741f23935e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1664,6 +1664,7 @@ extern struct pid *cad_pid;
> #define PF_VCPU 0x00000001 /* I'm a virtual CPU */
> #define PF_IDLE 0x00000002 /* I am an IDLE thread */
> #define PF_EXITING 0x00000004 /* Getting shut down */
> +#define PF_POSTCOREDUMP 0x00000008 /* Coredumps should ignore this task */

This might need some bikeshedding. It happens that the coredump code
(zap_process(), actually) will ignore it, but I think what's being
described is "this process has reached an point-of-no-return on the exit
path, where coredumping is either done or never happened".

> #define PF_IO_WORKER 0x00000010 /* Task is an IO worker */
> #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */
> #define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index cb1619d8fd64..774e6b5061b8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -339,23 +339,29 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
> }
> }
>
> -static void coredump_exit_mm(struct mm_struct *mm)
> +static void coredump_task_exit(struct task_struct *tsk)
> {
> struct core_state *core_state;
> + struct mm_struct *mm;
> +
> + mm = tsk->mm;
> + if (!mm)
> + return;
>
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_lock around checking core_state
> - * and clearing tsk->mm. The core-inducing thread
> + * and setting PF_POSTCOREDUMP. The core-inducing thread
> * will increment ->nr_threads for each thread in the
> - * group with ->mm != NULL.
> + * group without PF_POSTCOREDUMP set.
> */
> + mmap_read_lock(mm);
> + tsk->flags |= PF_POSTCOREDUMP;

What are the races possible here?

- 2 threads exiting, but neither have core_state, so no change, looks ok
- 1 thread exiting, 1 thread has dumped. core_state exists, in which
case this starts a loop to wait for wakeup?
if dumping thread enters zap_process, gets to sigaddset/signal_wake_up
then the exiting thread sets PF_POSTCOREDUMP and goes to sleep,
will it wait forever? (I can't tell if sighand locking prevents
this ordering problem...)
- 2 threads dumping -- similar race to above? I suspect I'm missing
something, as I'd expect this case to already be handled.

-Kees

> core_state = mm->core_state;
> + mmap_read_unlock(mm);
> if (core_state) {
> struct core_thread self;
>
> - mmap_read_unlock(mm);
> -
> self.task = current;
> if (self.task->flags & PF_SIGNALED)
> self.next = xchg(&core_state->dumper.next, &self);
> @@ -375,7 +381,6 @@ static void coredump_exit_mm(struct mm_struct *mm)
> freezable_schedule();
> }
> __set_current_state(TASK_RUNNING);
> - mmap_read_lock(mm);
> }
> }
>
> @@ -480,7 +485,6 @@ static void exit_mm(void)
> return;
> sync_mm_rss(mm);
> mmap_read_lock(mm);
> - coredump_exit_mm(mm);
> mmgrab(mm);
> BUG_ON(mm != current->active_mm);
> /* more a memory barrier than a real lock */
> @@ -768,6 +772,7 @@ void __noreturn do_exit(long code)
> profile_task_exit(tsk);
> kcov_task_exit(tsk);
>
> + coredump_task_exit(tsk);
> ptrace_event(PTRACE_EVENT_EXIT, code);
>
> validate_creds_for_do_exit(tsk);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 38681ad44c76..9bd9f2da9e41 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1392,8 +1392,7 @@ static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
> * purposes.
> */
> if (tsk->clear_child_tid) {
> - if (!(tsk->signal->flags & SIGNAL_GROUP_COREDUMP) &&
> - atomic_read(&mm->mm_users) > 1) {
> + if (atomic_read(&mm->mm_users) > 1) {
> /*
> * We don't check the error code - if userspace has
> * not set up a proper pointer then tough luck.
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c9759ff2cb43..b0db80acc6ef 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2158,31 +2158,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
> spin_unlock_irqrestore(&sighand->siglock, flags);
> }
>
> -static inline bool may_ptrace_stop(void)
> -{
> - if (!likely(current->ptrace))
> - return false;
> - /*
> - * Are we in the middle of do_coredump?
> - * If so and our tracer is also part of the coredump stopping
> - * is a deadlock situation, and pointless because our tracer
> - * is dead so don't allow us to stop.
> - * If SIGKILL was already sent before the caller unlocked
> - * ->siglock we must see ->core_state != NULL. Otherwise it
> - * is safe to enter schedule().
> - *
> - * This is almost outdated, a task with the pending SIGKILL can't
> - * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
> - * after SIGKILL was already dequeued.
> - */
> - if (unlikely(current->mm->core_state) &&
> - unlikely(current->mm == current->parent->mm))
> - return false;
> -
> - return true;
> -}
> -
> -
> /*
> * This must be called with current->sighand->siglock held.
> *
> @@ -2263,7 +2238,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
>
> spin_unlock_irq(&current->sighand->siglock);
> read_lock(&tasklist_lock);
> - if (may_ptrace_stop()) {
> + if (likely(current->ptrace)) {
> /*
> * Notify parents of the stop.
> *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 295c8bdfd6c8..7877c755ab37 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -788,7 +788,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
>
> /*
> * A coredumping process may sleep for an extended period in
> - * coredump_exit_mm(), so the oom killer cannot assume that
> + * coredump_task_exit(), so the oom killer cannot assume that
> * the process will promptly exit and release memory.
> */
> if (sig->flags & SIGNAL_GROUP_COREDUMP)
> --
> 2.20.1
>

--
Kees Cook