[RFC 0/1] fs/jffs2: Deadlock in write buffer recovery path

From: Wang Zhaolong
Date: Tue Mar 18 2025 - 22:47:42 EST


## Introduction

I discovered a potential deadlock in the JFFS2 file system when an error
occurs during the write buffer flush. The issue is an AA-deadlock where
the same lock (c->wbuf_sem) is acquired twice in a nested manner by the
same thread. The attached patch attempts to fix this issue, but I'm
concerned about a potential recursive stack overflow problem that could be
introduced by the fix.

## Problem Analysis

The deadlock occurs in the following execution path:

```
jffs2_write_end
jffs2_write_inode_range
jffs2_write_dnode
jffs2_flash_writev
down_write(&c->wbuf_sem) # first hold lock
__jffs2_flush_wbuf
mtd_write() # return error
jffs2_wbuf_recover
jffs2_reserve_space_gc
jffs2_do_reserve_space
jffs2_flush_wbuf_pad
down_write(&c->wbuf_sem) # AA deadlock
```

When an error occurs during mtd_write() in __jffs2_flush_wbuf, the error
handling code calls jffs2_wbuf_recover to attempt recovery. This function
eventually calls jffs2_reserve_space_gc and jffs2_do_reserve_space. If the
write buffer is dirty, jffs2_do_reserve_space will call jffs2_flush_wbuf_pad,
which tries to acquire the wbuf_sem again, resulting in a deadlock since
the thread already holds this lock.

## Problem Reproduce

### kernel config:

```
CONFIG_MTD=m
CONFIG_MTD_PARTITIONED_MASTER=y
CONFIG_MTD_NAND_CORE=m
CONFIG_MTD_NAND_ECC_SW_HAMMING=m
CONFIG_MTD_RAW_NAND=m
CONFIG_MTD_NAND_NANDSIM=m

CONFIG_JFFS2_FS=m
CONFIG_JFFS2_FS_DEBUG=0
CONFIG_JFFS2_FS_WRITEBUFFER=y
CONFIG_JFFS2_SUMMARY=y
CONFIG_JFFS2_FS_XATTR=y
CONFIG_JFFS2_FS_POSIX_ACL=y

CONFIG_FAULT_INJECTION=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y
CONFIG_FAIL_FUNCTION=y
CONFIG_FAIL_MAKE_REQUEST=y
```

### Reproduce step

```shell
ID="0xec,0xa1,0x00,0x15" # 128M 128KB 2KB
modprobe nandsim id_bytes=$ID parts=200
flash_erase /dev/mtd1 0 0
modprobe jffs2
mount -t jffs2 mtd1 /mnt
dd if=/dev/urandom of=/mnt/testfile bs=1M count=1


# Injecting function failure
mountpoint -q /sys/kernel/debug || mount -t debugfs nodev /sys/kernel/debug
FAILTYPE=fail_function
FAILFUNC=mtd_write
echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
printf %#x -5 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
# There is a 90% probability that triggering fails.
echo 90 > /sys/kernel/debug/$FAILTYPE/probability
echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
echo -1 > /sys/kernel/debug/$FAILTYPE/times
echo 0 > /sys/kernel/debug/$FAILTYPE/space
echo 1 > /sys/kernel/debug/$FAILTYPE/verbose

# Overwrite the file until the process stops responding.
while true; do dd if=/dev/urandom of=/mnt/testfile bs=1M count=1; done

## The solution I can think of

The approach in the attached patch is to add a new parameter (wbuf_sem_held)
to jffs2_reserve_space_gc and jffs2_do_reserve_space functions, indicating
whether the calling context already holds the wbuf_sem lock. When jffs2_wbuf_recover
calls jffs2_reserve_space_gc, it passes wbuf_sem_held=1 to indicate the
lock is already held.

Then in jffs2_do_reserve_space, when the buffer is dirty and wbuf_sem_held is set,
we call __jffs2_flush_wbuf directly instead of jffs2_flush_wbuf_pad to avoid trying
to acquire the lock again.

## Alternative Solutions Considered

1. Lock-free recovery path: Completely redesigning the recovery path to not
require locks. This would be a major change with high risk.

2. Using trylock: Replace down_write() with down_write_trylock() in
jffs2_flush_wbuf_pad, and handle the case where it returns 0. This approach
is problematic because we can't tell if the lock is held by our thread or
another thread.

3. Release and reacquire lock: Release wbuf_sem before calling jffs2_wbuf_recover
and reacquire it after. This could lead to consistency problems because another
thread might modify the write buffer state in between.

4. Introduce an additional global state:

## Potential Issue: Recursive Stack Overflow

While the patch fixes the deadlock, it introduces a potential recursive stack
overflow problem. If mtd_write() consistently fails, we could have this recursive
path:

```
__jffs2_flush_wbuf
mtd_write() # return error
jffs2_wbuf_recover
jffs2_reserve_space_gc
jffs2_do_reserve_space
__jffs2_flush_wbuf # with wbuf_sem_held=1
mtd_write() # return error again
jffs2_wbuf_recover
... # and so on
```

Each failure and recovery attempt adds another level to the call stack. If the
errors persist, this could lead to a stack overflow. Unfortunately, this is the
original trigger for the problem, because there is some unknown hardware exception
that returns -EIO for any write request,

## Request for Guidance

I'm seeking feedback on how to address both the deadlock.

1. Is my analysis of the deadlock correct?
2. Is the current patch approach sensible, or would a different solution be better?
3. How should we prevent the recursive stack overflow problem?
- One possibility is to clear c->wbuf_len in error paths in jffs2_wbuf_recover
- Another option is to detect recursion in jffs2_do_reserve_space and avoid
calling __jffs2_flush_wbuf when already in an error recovery path

I'd appreciate any feedback or suggestions on how to best address these issues while
maintaining the reliability of the JFFS2 file system.

Wang Zhaolong (1):
fs/jffs2: Avoid a possible deadlock in jffs2_wbuf_recover

fs/jffs2/gc.c | 12 ++++++------
fs/jffs2/nodelist.h | 12 +++++++++++-
fs/jffs2/nodemgmt.c | 16 ++++++++++------
fs/jffs2/os-linux.h | 1 +
fs/jffs2/wbuf.c | 13 ++-----------
fs/jffs2/write.c | 4 ++--
fs/jffs2/xattr.c | 4 ++--
7 files changed, 34 insertions(+), 28 deletions(-)

--
2.34.3