Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping

From: Christian Brauner

Date: Thu Jun 25 2026 - 03:28:38 EST

> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
>
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can
> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.
>
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.
>
> Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d450861e..bc6d3859f874 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
> /proc/<pid>/coredump_filter.
> See also Documentation/filesystems/proc.rst.
>
> + coredump_pre_exit=
> + [KNL] Change the default value for
> + /proc/<pid>/coredump_pre_exit.
> + See also Documentation/filesystems/proc.rst.

Nah, we're not doing a separate file for this. That makes no sense
whatsoever. I've already explained this in the first mail. There are
effectively three modes:

(1) dump to a file
(2) spawn super-privileged usermode helper process connect coredumping
process and said helper via pipe
(3) coredumping process connects to AF_UNIX socket

Parameterize (1) and (2) via a command line arguments. I strongly
suspect you're using some AI tooling so it should be able to figure out
how this was done in the past.

(3) can be extended by just introducing a new flag value for struct
coredump_req. That is also illustrated by previous work.

We're not spreading procfs files. It's terrible api design especially
for security sensitive changes.

> +static void coredump_pre_exit(void)
> +{
> + struct task_struct *tsk = current;
> + unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> + if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> + return;
> +
> + /*
> + * Set O_TMPCLOS of file f_flags if file needs to be closed.
> + */
> + if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> + !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> + exit_mmap_mapped_shared(tsk->mm);
> +
> + /*
> + * Check O_TMPCLOS of file f_flags to close file and clear it.
> + */
> + exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
> static int coredump_wait(int exit_code, struct core_state *core_state)
> {
> struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
> return;
> }
>
> + coredump_pre_exit();
> +
> switch (cn->core_type) {
> case COREDUMP_FILE:
> if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b162d0..a58ffffcc31d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
> #include <linux/file_ref.h>
> #include <net/sock.h>
> #include <linux/init_task.h>
> +#include <linux/filelock.h>
>
> #include "internal.h"
>
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
> }
> }
>
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> + struct files_struct *files = tsk->files;
> + struct fdtable *fdt;
> + struct file *file;
> + unsigned int i, j = 0;
> +
> + if (!files)
> + return;
> +
> + fdt = rcu_dereference_raw(files->fdt);
> + for (;;) {
> + unsigned long set;
> +
> + i = j * BITS_PER_LONG;
> + if (i >= fdt->max_fds)
> + break;
> + set = fdt->open_fds[j++];
> + while (set) {
> + if (!(set & 1))
> + goto next_fd;
> + file = fdt->fd[i];
> + if (!file)
> + goto next_fd;
> + if (file->f_flags & O_TMPCLOS) {
> + file->f_flags &= ~O_TMPCLOS;
> + goto close_fd;
> + }
> + if (!checkflock)
> + goto next_fd;
> + if (!vfs_inode_has_locks(file_inode(file)))
> + goto next_fd;
> +
> +close_fd:
> + fdt->fd[i] = NULL;
> + filp_close(file, files);
> + cond_resched();
> +
> +next_fd:
> + i++;
> + set >>= 1;
> + }
> + }
> +}
> +
> struct files_struct init_files = {
> .count = ATOMIC_INIT(1),
> .fdt = &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c894..99b5f219f7fa 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
> .write = proc_coredump_filter_write,
> .llseek = generic_file_llseek,
> };
> +
> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct task_struct *task = get_proc_task(file_inode(file));
> + struct mm_struct *mm;
> + char buffer[PROC_NUMBUF];
> + size_t len;
> + int ret;
> +
> + if (!task)
> + return -ESRCH;
> +
> + ret = 0;
> + mm = get_task_mm(task);
> + if (mm) {
> + unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> + len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> + ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> + MMF_DUMP_PRE_EXIT_SHIFT));
> + mmput(mm);
> + ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> + }
> +
> + put_task_struct(task);
> +
> + return ret;
> +}
> +
> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> + const char __user *buf,
> + size_t count,
> + loff_t *ppos)
> +{
> + struct task_struct *task;
> + struct mm_struct *mm;
> + unsigned int val;
> + int ret;
> + int i;
> + unsigned long mask;
> +
> + ret = kstrtouint_from_user(buf, count, 0, &val);
> + if (ret < 0)
> + return ret;
> +
> + ret = -ESRCH;
> + task = get_proc_task(file_inode(file));
> + if (!task)
> + goto out_no_task;
> +
> + mm = get_task_mm(task);
> + if (!mm)
> + goto out_no_mm;
> + ret = 0;
> +
> + for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
> + if (val & mask)
> + mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> + else
> + mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> + }
> +
> + mmput(mm);
> + out_no_mm:
> + put_task_struct(task);
> + out_no_task:
> + if (ret < 0)
> + return ret;
> + return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> + .read = proc_coredump_pre_exit_read,
> + .write = proc_coredump_pre_exit_write,
> + .llseek = generic_file_llseek,
> +};
> #endif
>
> #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> #endif
> #ifdef CONFIG_ELF_CORE
> REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> + REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
> #endif
> #ifdef CONFIG_TASK_IO_ACCOUNTING
> ONE("io", S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9dbd..dfd4717c7e3e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
> extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
> extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);
> bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, bool write);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6a30..0555aaf50001 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
> (BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
> BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK 11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT (MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS 2
> +#define MMF_DUMP_PRE_EXIT_MASK \
> + (((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
> +
> #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
> # define MMF_DUMP_MASK_DEFAULT_ELF BIT(MMF_DUMP_ELF_HEADERS)
> #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cffc9..b4becbf6c0eb 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
> extern __noreturn void do_group_exit(int);
>
> extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
> extern void exit_itimers(struct task_struct *);
>
> extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..360604d653b4 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
> #define O_NDELAY O_NONBLOCK
> #endif
>
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS 0x80000000 /* tag need close, temporarily used */
> +#endif

Sorry, not going to happen. This doesn't not justify the addition of a
new uapi value at all.

I'm also including various Sashkio comments:

sashiko.dev <sashiko@xxxxxxxxxxx>:

[Severity: Medium]
Is it safe to expose an internal, temporary kernel flag in a UAPI header?
Userspace applications could intentionally or accidentally pass O_TMPCLOS to
open(), which might permanently pollute the userspace ABI and trigger
unexpected behavior during a coredump.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@xxxxxxx

> +
> #define F_DUPFD 0 /* dup */
> #define F_GETFD 1 /* get close_on_exec */
> #define F_SETFD 2 /* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448234..84f1ee7f32cf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>
> __setup("coredump_filter=", coredump_filter_setup);
>
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> + default_dump_pre_exit =
> + (simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> + MMF_DUMP_PRE_EXIT_MASK;
> + return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);

This makes no sense. I think you really need to sit down and think about
a design for this that doesn't introduce state machinery for boot, mm,
and the VFS in one shot to solve a fringe problem...

sashiko.dev <sashiko@xxxxxxxxxxx>:

[Severity: High]
Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
iteration invalidate the outer iterator? The loop traverses the maple tree
using the iterator vmi. However, do_munmap() creates its own internal
VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
iterator is not updated to reflect these structural changes, its cached
state becomes stale, which can lead to a use-after-free when vma_next()
is subsequently called.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@xxxxxxx

sashiko.dev <sashiko@xxxxxxxxxxx>:

[Severity: High]
Is it safe to iterate the file descriptor table without holding
rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
kills other threads, concurrent threads can still trigger expand_files(),
which replaces the fdt and frees the old one after an RCU grace period.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@xxxxxxx

sashiko.dev <sashiko@xxxxxxxxxxx>:

[Severity: Medium]
Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
of file->f_flags risks losing concurrent fcntl() updates since it doesn't
hold file->f_lock.

Also, if a file has duplicated file descriptors (e.g., via dup()), will
clearing O_TMPCLOS here prematurely skip the closure of the remaining
descriptors? When encountering the duplicated descriptor later, the flag
will already be cleared, leaving the shared file actively referenced.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@xxxxxxx

--
Christian Brauner <brauner@xxxxxxxxxx>