Re: [PATCH v3] coredump: exit_files() in coredump_wait() if MMF_DUMP_MAPPED_SHARED is not set

From: Mateusz Guzik

Date: Fri Jun 19 2026 - 13:33:44 EST


On Fri, Jun 19, 2026 at 03:54:26PM +0200, Christian Brauner wrote:
> > A coredump typically takes some time to complete. If we happen to hold a
> > write lock with flock just before triggering the coredump, that write lock
> > will not be released during the entire coredump process. As a result,
> > other processes attempting to acquire the same write lock may experience
> > significant delays.
> >
> > To address this, call exit_files() in the end of coredump_wait(), if
> > MMF_DUMP_MAPPED_SHARED is not set. Note that early unlocking a flock on a
> > file allows other processes to lock and modify the mapped data protected
> > by the flock.
> >
> > Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index bb6fdb1f458e..70698d06ec9d 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -548,6 +548,13 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
> > }
> > }
> >
> > + /*
> > + * Early unlocking a flock on a file allows other processes
> > + * to lock and modify the mapped data protected by the flock.
> > + */
> > + if (!mm_flags_test(MMF_DUMP_MAPPED_SHARED, tsk->mm))
> > + exit_files(tsk);
>
> This doesn't work - at least not unconditionally. Tools like
> systemd-coredump or apport go through the fds. Specifically
> systemd-coredump does:
>
> 1) /proc/[pid]/fd/ — opendir() then, per entry, readlinkat() to get the symlink target.
> 2) /proc/[pid]/fdinfo/ — for each fd it reads the fdinfo text lines
>
> The blob is attached to the journal record as the COREDUMP_OPEN_FDS=
> field. So the open-fd list is recorded as metadata, retrievable later
> (e.g. coredumpctl info shows it).
>
> Also, irc some clever implementations use pidfd_getfd() to preserve the
> files from a coredumping process to preserve them.
>
> So you break all that - and only in some of the cases which is really
> opaque to userspace. That's not acceptable. If you only care about the
> case where you dump to a file then either special-case it to the legacy
> file coredump format or if it's generally useful make it an optional
> argument that can be passed to the coredump pipe and a new flag
> extension to the coredump socket that makes the coredumping process shed
> it's file descriptors.
>

The claim is the coredump takes "some time" without specifying what kind
of window is it (seconds, minutes?), nor where said time is spent.

For example it is known that a big mmapped areas are slow to dump even
if they are sparsely populated.

So I would suggest profiling what exactly happens in this case. It is
virtually guaranteed the time can be shortened, but it is plausible even
with fixups it will be too long.

Which brings me to the next point: the issue is not strictly any
particular file is still referenced by the target process, but that the
lock is held on it. Perhaps it would be perfectly fine to walk the fd
table and release these here instead of much later?