[PATCH v5] coredump: Add bit 9 of coredump_filter for pre-exit files before dumping
From: Xin Zhao
Date: Tue Jun 30 2026 - 04:04:12 EST
A coredump typically takes seconds or even longer to complete. If we
happen to hold a write lock with flock just before triggering the
coredump, that write lock will not be released during the entire coredump
process. As a result, other processes attempting to acquire the same write
lock may experience significant delays. Another typical scenario is that
some custom management modules for shared memory also need to release the
reference counts of the related buffers as soon as possible, rather than
waiting until the coredump is complete.
Add a new bit(9) of coredump_filter to tag whether need to dump fd list.
We set it by default because tools like systemd-coredump go through the
fds. Some other coredump pipe programs like minicoredump do not use fds by
default. If you are sure that your coredump backend does not use the fds,
you can clear bit 9, which will allow some file resources without VMA
references to be released earlier.
In fput(), check FP_DUMPCORE task flags to NOT release file by task work,
otherwise file put operation will NOT execute util coredump finish.
Test Case One - flock
Test program send signal SIGABRT to the program which owns the flock,
output the wait time(unit ms) to successfully attach the flock.
Test program malloc 500MB heap and memset it.
If NOT set bit9 of coredump_filter, waitms is 11280.
If set bit9 of coredump_filter, waitms is 0.
Test Case Two - ion buffer
Test programs include ion buffer publisher and ion buffer subscriber.
Ion buffer publisher output the ion buffer hold_time if the subscriber
NOT send ack to publisher and NOT release it. The subscriber will trig
coredump by itself in some time.
If NOT set bit9 of subscriber coredump_filter, max hold_time is 19591ms.
If set bit9 of subscriber coredump_filter, max hold_time is 320ms.
Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
---
Change in v5:
- Not add another bootargs for the feature,
as suggested by Christian Brauner and Lorenzo Stoakes.
Add bit9 of coredump_filter to tag whether need to dump fd list.
Set bit9 to 1 as default.
- Al Viro, Christian Brauner and Lorenzo Stoakes point out so many
problems of the code related to umap that was added in v4, delete all of
it which is unnecessary. The management of reference counting for shared
memory generally does not need to be released through the release
operation of files that have VMA references. Traversing all the threads
within the process and executing exit_files() is sufficient.
- Fulfill comments and commit log,
as suggested by Pedro Falcato and Lorenzo Stoakes.
Change in v4:
- Christian pointed out that the coredump process will traverse file
descriptors (fd), so certain fds should not be closed by default.
Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
pre-exit resources selection, default is NOT pre-exit anything.
- Mateusz suggested that walking the fd table and release the file-lock is
reasonable. No longer release all the fd(s). Based on user config, only
the flock fd(s) and the fd(s) correspondent to file-backed shared memory
will be released at most.
- Link to v4: https://lore.kernel.org/all/20260624145552.70143-1-jackzxcui1989@xxxxxxx/
Change in v3:
- Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
mm_flags_test() check, note that memory mapped files keep their own
separate references to the files. The case to work around is that early
unlocking a flock on a file allows other processes to lock and modify
the mapped data protected by the flock,
as suggested by Pedro Falcato.
- Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@xxxxxxx/
Change in v2:
- Get rid of the implement of adding new fcntl API, the issue does not
worth inflicting the cost on everyone,
as suggested by Al Viro.
- Call exit_files() in coredump_wait(),
as suggested by Eric W. Biederman.
Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
need to dump file-backed shared memory.
- Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@xxxxxxx/
v1:
- Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@xxxxxxx/
---
Documentation/filesystems/proc.rst | 14 ++++++++++++--
fs/coredump.c | 21 +++++++++++++++++++++
fs/file_table.c | 7 ++++++-
include/linux/mm_types.h | 6 ++++--
4 files changed, 43 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index db6167bef..d590a1dda 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1939,6 +1939,7 @@ The following 9 memory types are supported:
- (bit 6) hugetlb shared memory
- (bit 7) DAX private memory
- (bit 8) DAX shared memory
+ - (bit 9) fd list
Note that MMIO pages such as frame buffer are never dumped and vDSO pages
are always dumped regardless of the bitmask status.
@@ -1946,13 +1947,22 @@ The following 9 memory types are supported:
Note that bits 0-4 don't affect hugetlb or DAX memory. hugetlb memory is
only affected by bit 5-6, and DAX is only affected by bits 7-8.
+ Note that bit 9 is set by default because tools like systemd-coredump go
+ through the fds. If you do not set bit 9, files that are not referenced by
+ any VMA are released before dumping core. Some file release logic, such as
+ exiting flock or releasing references to shared buffers is executed much
+ earlier.
+
+The default value of coredump_filter is 0x233; this means all anonymous memory
+segments, ELF header pages, hugetlb private memory and fd list are dumped.
+
The default value of coredump_filter is 0x33; this means all anonymous memory
segments, ELF header pages and hugetlb private memory are dumped.
If you don't want to dump all shared memory segments attached to pid 1234,
-write 0x31 to the process's proc file::
+write 0x231 to the process's proc file::
- $ echo 0x31 > /proc/1234/coredump_filter
+ $ echo 0x231 > /proc/1234/coredump_filter
When a new process is created, the process inherits the bitmask status from its
parent. It is useful to set up coredump_filter before the program runs.
diff --git a/fs/coredump.c b/fs/coredump.c
index bb6fdb1f4..ed4d30916 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -521,6 +521,25 @@ static int zap_threads(struct task_struct *tsk,
return nr;
}
+/*
+ * If do not dump fd list, files that are not referenced by any VMA
+ * can be released before dumping core. Therefore, some file release
+ * logic, such as exiting flock or releasing references to shared
+ * buffers is executed much earlier. Note that do_coredump() often
+ * takes several seconds or even longer to execute.
+ */
+static void coredump_pre_exit(void)
+{
+ struct task_struct *tsk = current, *t;
+
+ if (mm_flags_test(MMF_DUMP_FD_LIST, tsk->mm))
+ return;
+
+ for_each_thread(tsk, t) {
+ exit_files(t);
+ }
+}
+
static int coredump_wait(int exit_code, struct core_state *core_state)
{
struct task_struct *tsk = current;
@@ -1124,6 +1143,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
if (cn->mask & COREDUMP_REJECT)
return;
+ coredump_pre_exit();
+
/* get us an unshared descriptor table; almost always a no-op */
/* The cell spufs coredump code reads the file descriptor tables */
if (unshare_files())
diff --git a/fs/file_table.c b/fs/file_table.c
index 16e52e7fc..399db62f6 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -566,7 +566,12 @@ static void __fput_deferred(struct file *file)
return;
}
- if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+ /*
+ * coredump_pre_exit() may release files before dumping core.
+ * Cannot use task_work in the case, needs to release files
+ * earlier."
+ */
+ if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD | PF_DUMPCORE))) {
init_task_work(&file->f_task_work, ____fput);
if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
return;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c7db35be6..e865edb04 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1954,14 +1954,16 @@ enum {
#define MMF_DUMP_HUGETLB_SHARED 8
#define MMF_DUMP_DAX_PRIVATE 9
#define MMF_DUMP_DAX_SHARED 10
+#define MMF_DUMP_FD_LIST 11
#define MMF_DUMP_FILTER_SHIFT MMF_DUMPABLE_BITS
-#define MMF_DUMP_FILTER_BITS 9
+#define MMF_DUMP_FILTER_BITS 10
#define MMF_DUMP_FILTER_MASK \
((BIT(MMF_DUMP_FILTER_BITS) - 1) << MMF_DUMP_FILTER_SHIFT)
#define MMF_DUMP_FILTER_DEFAULT \
(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
- BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
+ (1 << MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF |\
+ (1 << MMF_DUMP_FD_LIST))
#ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
# define MMF_DUMP_MASK_DEFAULT_ELF BIT(MMF_DUMP_ELF_HEADERS)
--
2.34.1