[PATCH] xfs: prevent close() from hanging on frozen filesystems

From: Aditya Srivastava

Date: Wed Jun 10 2026 - 09:19:04 EST

From: Aditya Prakash Srivastava <aditya.ansh182@xxxxxxxxx>

When a file with active speculative post-EOF preallocations is closed,
xfs_file_release() synchronously triggers xfs_free_eofblocks() to clean
them up. This requires allocating a write transaction (xfs_trans_alloc),
which blocks indefinitely if the filesystem is currently frozen or in the
process of freezing, as it waits to acquire the superblock's write lock.

As a result, a close() system call on a read-write file descriptor can
hang indefinitely in percpu_rwsem_wait() until the filesystem is thawed,
even if the file is closed by a non-writer process or after all writing
activity has already ceased.

This issue has been seen across multiple downstream environments and has a
long history of causing severe system disruption. For example:
- Downstream Red Hat Bugzilla 1474726 (dating back to 2017) details
complete system hangs during system backups when rsync and fsfreeze
are used. Even seemingly harmless read-only commands like
'cat /var/log/messages' would hang on close() in __sb_start_write
via xfs_free_eofblocks, requiring a hard reboot.
- Downstream LeApp integration test scenarios (e.g. systemd-rsync migration
checks) consistently hit this hang when trying to freeze the system.

Historically, XFS maintainers dismissed this behavior as NOTABUG, claiming
that close() is not a read-only operation and is expected to block since it
allocates write transactions. However, this behavior is highly disruptive.
User-space applications view close() as a resource reclamation system call,
not a write operation, and do not expect it to block. Hanging on close()
frequently triggers container healthcheck failures, systemd service
timeouts, and cluster failover cascades.

Additionally, no other major Linux filesystem (such as ext4 or btrfs)
synchronously allocates write transactions during close() system calls,
making this hang a highly unexpected and disruptive behavior unique to XFS.

We can safely skip this post-EOF cleanup optimization during a filesystem
freeze because:

1. Speculative preallocation is purely a performance heuristic to prevent
fragmentation, not a requirement for file correctness or metadata
consistency. The frozen snapshot remains completely consistent and safe,
regardless of whether these post-EOF blocks are freed before or
after thaw.

2. No space is permanently leaked. Any skipped speculative preallocations
are safely preserved and will be scanned and reclaimed automatically by
the background block garbage collection (blockgc) workers once the
filesystem is thawed.

3. Precedent already exists in xfs_file_release() to skip this truncation:
it already uses xfs_ilock_nowait() and silently skips the cleanup if
the lock cannot be acquired, relying on background or future cleanup to
avoid mmdeadlocks. Skipping under fsfreeze is highly consistent with
this existing design.

Note that background blockgc and inodegc workers are already explicitly
stopped during freeze (via xfs_blockgc_stop() and xfs_inodegc_stop()),
leaving the synchronous xfs_file_release() path as the sole remaining
unblocked path that could attempt write transactions on a frozen
filesystem.

Fix this hang by checking if the filesystem is writable at the
SB_FREEZE_WRITE level in xfs_file_release() and returning early if it
is frozen or freezing.

A simple C reproducer demonstrating the hang (compile with -pthread):

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <pthread.h>
#include <sys/ioctl.h>
#include <sys/vfs.h>
#include <linux/fs.h>
#include <libgen.h>

volatile int close_started = 0;
volatile int close_completed = 0;

void *close_thread(void *arg) {
int fd = *(int *)arg;
close_started = 1;
close(fd);
close_completed = 1;
return NULL;
}

int main(int argc, char *argv[]) {
struct statfs sfs;
statfs(argv[1], &sfs);
if (sfs.f_type != 0x58465342) return 1;

int freeze_fd = open(dirname(strdup(argv[1])), O_RDONLY);
int write_fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
char buf[65536] = {0};
for (int i = 0; i < 320; i++) write(write_fd, buf, sizeof(buf));

ioctl(freeze_fd, FIFREEZE, 0);
pthread_t thread;
pthread_create(&thread, NULL, close_thread, &write_fd);
while (!close_started) usleep(1000);
usleep(1000000); // Wait 1s
if (!close_completed) printf("SUCCESS: close() hung!\\n");
ioctl(freeze_fd, FITHAW, 0);
pthread_join(thread, NULL);
unlink(argv[1]);
return 0;
}

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205833
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1474726
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@xxxxxxxxx>
---
fs/xfs/xfs_file.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..401403e066c9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1798,6 +1798,15 @@ xfs_file_release(
xfs_is_zoned_inode(ip))
return 0;

+ /*
+ * If the filesystem is frozen or freezing, don't trigger transactions
+ * that would block close() indefinitely. Background block garbage
+ * collection will clean up these speculative preallocations once
+ * the filesystem thaws.
+ */
+ if (!xfs_fs_writable(mp, SB_FREEZE_WRITE))
+ return 0;
+
/*
* If we can't get the iolock just skip truncating the blocks past EOF
* because we could deadlock with the mmap_lock otherwise. We'll get
--
2.47.3