Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping

From: Xin Zhao

Date: Thu Jun 25 2026 - 04:52:16 EST

On Thu, 25 Jun 2026 09:28:08 +0200 Christian Brauner <brauner@xxxxxxxxxx> wrote:

> > + coredump_pre_exit=
> > + [KNL] Change the default value for
> > + /proc/<pid>/coredump_pre_exit.
> > + See also Documentation/filesystems/proc.rst.
>
> Nah, we're not doing a separate file for this. That makes no sense
> whatsoever. I've already explained this in the first mail. There are
> effectively three modes:
>
> (1) dump to a file
> (2) spawn super-privileged usermode helper process connect coredumping
> process and said helper via pipe
> (3) coredumping process connects to AF_UNIX socket
>
> Parameterize (1) and (2) via a command line arguments. I strongly
> suspect you're using some AI tooling so it should be able to figure out
> how this was done in the past.
>
> (3) can be extended by just introducing a new flag value for struct
> coredump_req. That is also illustrated by previous work.
>
> We're not spreading procfs files. It's terrible api design especially
> for security sensitive changes.

The coredump socket approach is easier to implement because it allows for
interaction between the server and client, enabling the customization of
protocols. However, for the coredump file method, I can only think of
defining "r" and "R" through core_pattern to release flock and file-backed
shared data in advance. I'm unsure if this is feasible, as it changes the
original definition of core_pattern.

Regarding the coredump pipe, there is also a lack of a mechanism for the
pipe program to notify the coredump process, so it might still require
adding "r" and "R" at the end of core_pattern to indicate this, allowing
the coredump process to handle the early release on its own. I'm not sure
if my understanding is correct.

Even if the coredump pipe program obtains the file pointer from the process
that generated the coredump, it cannot reduce the reference count of the
file (which I understand is a very bad attempt). Since it cannot decrease
the reference count of the file, the early release must still be performed
by the task that generated the coredump. Given this situation, it seems
that we indeed need to use core_pattern for marking. I've thought for a
long time about more suitable solutions, but I haven't come up with any.

> > +#ifndef O_TMPCLOS
> > +#define O_TMPCLOS 0x80000000 /* tag need close, temporarily used */
> > +#endif
>
> Sorry, not going to happen. This doesn't not justify the addition of a
> new uapi value at all.

OK, if I use it at last, I will not put it in user header file.

> > +
> > +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
>
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

I'll get rid of the attempt to add a new boot-up argument for this feature.

> [Severity: High]
> Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
> iteration invalidate the outer iterator? The loop traverses the maple tree
> using the iterator vmi. However, do_munmap() creates its own internal
> VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
> iterator is not updated to reflect these structural changes, its cached
> state becomes stale, which can lead to a use-after-free when vma_next()
> is subsequently called.
>
> via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@xxxxxxx

When executing this traversal logic, we have already acquired a lock, and
the process has been frozen. The traversal logic goes from start to finish.
Are you sure that this approach could still have issues?

> [Severity: High]
> Is it safe to iterate the file descriptor table without holding
> rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
> kills other threads, concurrent threads can still trigger expand_files(),
> which replaces the fdt and frees the old one after an RCU grace period.

Since the process has already been frozen, shouldn't we not need to consider
such concurrency issues?

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
>
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
>
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

Currently, this flag will only be used by the logic we added, so I believe
there won't be any issues.

Thanks
Xin Zhao