[PATCH 04/11] signal: Move stopping for the coredump from do_exit into get_signal

From: Eric W. Biederman

Date: Fri Jun 26 2026 - 12:55:55 EST

Stopping to participate in a coredump from a kernel oops makes no
sense and is actively dangerous because the kernel is known to be
broken. Considering to stop in a coredump from a kernel thread exit
is silly because userspace coredumps are not generated from kernel
threads. Not stopping for a coredump in exit(2) and exit_group(2) and
related userspace exits that call do_exit or do_group_exit directly is
the current behavior of the code as the PF_SIGNALED test in
coredump_task_exit attests.

Since only tasks that pass through get_signal and set PF_SIGNALED can
join coredumps move stopping for coredumps into get_signal, where the
PF_SIGNALED test is unnecessary. This avoids even the potential of
stopping for coredumps in the silly or dangerous places.

This can be seen to be safe by examining the few places that call do_exit:

- get_signal calling do_group_exit
Called by get_signal to terminate the userspace process. As stopping
for the coredump happens now happens in get_signal the code will
continue to participate in the coredump.

- exit_group(2) calling do_group_exit

If a thread calls exit_group(2) while another thread in the same process
is performing a coredump there is a race. The thread that wins the
race will take the lock and set SIGNAL_GROUP_EXIT. If it is the
thread that called do_group_exit then zap_threads will return -EAGAIN
and no coredump will be generated. If it is the thread that is
coredumping that wins the race, the task that called do_group_exit
will exit gracefully with an error code before the coredump begins.

Having a single thread exit just before the coredump starts is not
ideal as the semantics make no sense. (Did the group exit happen
before the coredump or did the coredump happen before the group
exit?).

Eventually I intend for group exits to flow through get_signal and
this silliness will no longer be possible. Until then the current
behavior when this race occurs is maintained.

- io_uring
Called after get_signal returns to terminate the I/O worker thread
(essentially a userspace thread that only runs kernel code) so that
additional cleanup code can be run before do_exit. As get_signal is
called the prior to do_exit code will continue to participate in the
coredump.

- make_task_dead
Called on an unhandled kernel or hardware failure. As the failure
is unhandled any extra work has the potential to make the failure worse
so being part of a coredump is not appropriate.

- kthread_exit
Called to terminate a kernel thread as such coredumps do not exist.

- call_usermodehelper_exec_async
Called to terminate a kernel thread if kerenel_execve fails, as it is a
kernel thread coredumps do not exist.

- reboot, seeccomp
For these calls of do_exit() they are semantically direct calls of
exit(2) today. As do_exit() does not synchronize with siglock there
is no logical race between a coredump killing the thread and these
threads exiting. These threads logically exit before the coredump
happens. This is also the current behavior so there is nothing to
be concerned about with respect to userspsace semantics or
regresssions.

Moving the coredump stop for userspace threads that did not dequeue
the coredumping signal from from do_exit into get_signal in general is
safe, because the coredump in the single threaded case completely
happens in get_signal. The code movement ensures that a
multi-threaded coredump will not have any issues because the
additional threads stop after some amount of cleanup has been done.

The coredump code is robust to all kinds of userspace changes
happening in parallel as multiple processes can share a mm. This
makes the it safe to perform the coredump before the io_uring cleanup
happens as io_uring can't do anything another process sharing the mm
would not be doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
---
fs/coredump.c | 25 ++++++++++++++++++++++++-
include/linux/coredump.h | 2 ++
kernel/exit.c | 35 +++++++----------------------------
kernel/signal.c | 5 +++++
mm/oom_kill.c | 2 +-
5 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index bb6fdb1f458e..96801792a80e 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -521,6 +521,29 @@ static int zap_threads(struct task_struct *tsk,
return nr;
}

+void coredump_join(struct core_state *core_state)
+{
+ /* Stop and join the in-progress coredump */
+ struct core_thread self;
+
+ self.task = current;
+ self.next = xchg(&core_state->dumper.next, &self);
+ /*
+ * Implies mb(), the result of xchg() must be visible
+ * to core_state->dumper.
+ */
+ if (atomic_dec_and_test(&core_state->nr_threads))
+ complete(&core_state->startup);
+
+ for (;;) {
+ set_current_state(TASK_IDLE|TASK_FREEZABLE);
+ if (!self.task) /* see coredump_finish() */
+ break;
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+}
+
static int coredump_wait(int exit_code, struct core_state *core_state)
{
struct task_struct *tsk = current;
@@ -567,7 +590,7 @@ static void coredump_finish(bool core_dumped)
next = curr->next;
task = curr->task;
/*
- * see coredump_task_exit(), curr->task must not see
+ * see coredump_join(), curr->task must not see
* ->task == NULL before we read ->next.
*/
smp_mb();
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 68861da4cf7c..c183c95f9063 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -43,6 +43,7 @@ extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
extern int dump_align(struct coredump_params *cprm, int align);
int dump_user_range(struct coredump_params *cprm, unsigned long start,
unsigned long len);
+extern void coredump_join(struct core_state *core_state);
extern void vfs_coredump(const kernel_siginfo_t *siginfo);

/*
@@ -63,6 +64,7 @@ extern void vfs_coredump(const kernel_siginfo_t *siginfo);
#define coredump_report_failure(fmt, ...) __COREDUMP_PRINTK(KERN_WARNING, fmt, ##__VA_ARGS__)

#else
+extern inline void coredump_join(struct core_state *core_state) {}
static inline void vfs_coredump(const kernel_siginfo_t *siginfo) {}

#define coredump_report(...)
diff --git a/kernel/exit.c b/kernel/exit.c
index 4bfecf2a510d..20dfa8b2101f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -421,32 +421,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
}
}

-static void coredump_task_exit(struct task_struct *tsk,
- struct core_state *core_state)
-{
- struct core_thread self;
-
- self.task = tsk;
- if (self.task->flags & PF_SIGNALED)
- self.next = xchg(&core_state->dumper.next, &self);
- else
- self.task = NULL;
- /*
- * Implies mb(), the result of xchg() must be visible
- * to core_state->dumper.
- */
- if (atomic_dec_and_test(&core_state->nr_threads))
- complete(&core_state->startup);
-
- for (;;) {
- set_current_state(TASK_IDLE|TASK_FREEZABLE);
- if (!self.task) /* see coredump_finish() */
- break;
- schedule();
- }
- __set_current_state(TASK_RUNNING);
-}
-
#ifdef CONFIG_MEMCG
/* drops tasklist_lock if succeeds */
static bool __try_to_set_owner(struct task_struct *tsk, struct mm_struct *mm)
@@ -889,8 +863,13 @@ static void synchronize_group_exit(struct task_struct *tsk, long code)
core_state = signal->core_state;
spin_unlock_irq(&sighand->siglock);

- if (unlikely(core_state))
- coredump_task_exit(tsk, core_state);
+ /*
+ * Decrement ->nr_threads and possibly complete
+ * core_state->startup to politely skip participating in any
+ * pending coredumps.
+ */
+ if (unlikely(core_state) && atomic_dec_and_test(&core_state->nr_threads))
+ complete(&core_state->startup);
}

void __noreturn do_exit(long code)
diff --git a/kernel/signal.c b/kernel/signal.c
index d111b779cbdb..c211b520982f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2864,6 +2864,7 @@ bool get_signal(struct ksignal *ksig)

for (;;) {
bool group_exit_needed = false;
+ struct core_state *core_state;
struct k_sigaction *ka;
enum pid_type type;
int exit_code = 0;
@@ -3022,6 +3023,7 @@ bool get_signal(struct ksignal *ksig)
}
}
fatal:
+ core_state = signal->core_state;
spin_unlock_irq(&sighand->siglock);
if (unlikely(cgroup_task_frozen(current)))
cgroup_leave_frozen(true);
@@ -3041,6 +3043,9 @@ bool get_signal(struct ksignal *ksig)
* that value and ignore the one we pass it.
*/
vfs_coredump(&ksig->info);
+ } else if (core_state) {
+ /* Wait for the coredump to happen */
+ coredump_join(core_state);
}

/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5f372f6e26fa..ff9d59963561 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -840,7 +840,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)

/*
* A coredumping process may sleep for an extended period in
- * coredump_task_exit(), so the oom killer cannot assume that
+ * get_signal(), so the oom killer cannot assume that
* the process will promptly exit and release memory.
*/
if (sig->core_state)
--
2.41.0