Re: [PATCH v5] f2fs: protect published gc_thread during teardown

From: Cen Zhang

Date: Sat Jun 27 2026 - 11:47:52 EST

Hi,

Just a very gentle ping on this v5 patch.

I wanted to check whether this approach looks acceptable, or if there is
anything else you would like me to change.

I'm happy to respin it if needed.

Thanks a lot for your time,

Best regards,
Zhang Cen

Zhang Cen <rollkingzzc@xxxxxxxxx> 于2026年6月16日周二 14:49写道：
>
> f2fs_stop_gc_thread() stops the background GC task, wakes GC_MERGE
> foreground waiters, frees sbi->gc_thread, and then clears the published
> pointer. A foreground f2fs_balance_fs() caller can already have copied
> that pointer and queued itself on gc_th->fggc_wq, so freeing gc_th at
> stop time can leave finish_wait() operating on a freed waitqueue.
>
> The buggy scenario involves two paths, with each column showing the
> order within that path:
>
> foreground f2fs_balance_fs() caller: shutdown path:
> 1. observes no checkpoint error 1. sets CP_ERROR_FLAG
> 2. snapshots sbi->gc_thread 2. enters f2fs_stop_gc_thread()
> 3. queues on gc_th->fggc_wq 3. stops gc_th->f2fs_gc_task
> 4. wakes gc_wait_queue_head 4. wakes gc_th->fggc_wq
> 5. sleeps for foreground GC 5. frees gc_th
> 6. finish_wait() touches fggc_wq
>
> GC_MERGE does not keep independent work_struct items that shutdown can
> cancel. Its pending foreground GC requests are waitqueue waiters. Drain
> them by withdrawing the GC task pointer, stopping the task, waking
> gc_th->fggc_wq, and leaving each waiter to remove its own wait entry
> with finish_wait().
>
> Keep the allocated GC-thread state until the superblock is destroyed and
> use gc_th->f2fs_gc_task as the running-state marker. The stop path now
> withdraws the task pointer under gc_task_lock, stops the task, and wakes
> foreground waiters, but leaves the waitqueue storage valid. The start
> path reuses a stopped gc_thread object instead of reinitializing its
> waitqueues, and remount restart decisions check the task pointer rather
> than only the object pointer.
>
> f2fs_balance_fs() also snapshots sbi->gc_thread once and rechecks both
> f2fs_cp_error() and f2fs_gc_task after prepare_to_wait(). If shutdown is
> visible or the GC task has already been withdrawn, the caller removes its
> wait entry without waking the GC thread or sleeping for new foreground GC
> work. Thus shutdown drains already queued waiters and stops accepting
> new foreground GC work once shutdown is visible to the caller.
>
> Task pointer users are protected separately from the retained container
> lifetime. A sysfs critical_task_priority store now snapshots the GC task
> under gc_task_lock and holds a task_struct reference while calling
> set_user_nice(). Boolean running-state checks that do not dereference the
> task_struct continue to use READ_ONCE().
>
> One observed report was:
>
> BUG: KASAN: slab-use-after-free in finish_wait+0x276/0x290
> Write of size 8 at addr ffff8881150819b8 by task dd/802
> The buggy address belongs to the object at ffff888115081900 which
> belongs to the cache kmalloc-256 of size 256
> The buggy address is located 184 bytes inside of freed 256-byte region
> Call trace:
> finish_wait()
> f2fs_balance_fs()
> f2fs_write_single_data_page()
> f2fs_write_cache_pages()
> __f2fs_write_data_pages()
> do_writepages()
> filemap_fdatawrite_wbc()
> __filemap_fdatawrite_range()
> file_write_and_wait_range()
> f2fs_do_sync_file()
> f2fs_sync_file()
> do_fsync()
> Freed by task stack:
> kfree()
> f2fs_stop_gc_thread()
> f2fs_do_shutdown()
> f2fs_shutdown()
> fs_bdev_mark_dead()
>
> Fixes: 5911d2d1d1a3 ("f2fs: introduce gc_merge mount option")
> Assisted-by: Codex:gpt-5.5
> Signed-off-by: Zhang Cen <rollkingzzc@xxxxxxxxx>
> ---
> v5:
> Explain that GC_MERGE foreground requests are waitqueue waiters, not
> independent work items; shutdown drains them by withdrawing the GC task,
> waking fggc_wq, and letting waiters dequeue themselves.
> Recheck checkpoint error and the GC task pointer after prepare_to_wait() so
> shutdown-visible f2fs_balance_fs() callers do not wake or sleep for new
> foreground GC work.
>
> v4:
> Replace the v3 SRCU/refcounted lifetime model with a smaller fix that
> keeps the existing heap-allocated gc_thread object alive until
> superblock teardown.
> Use f2fs_gc_task as the running-state marker and withdraw it with
> xchg() before waking GC_MERGE waiters.
> Reuse a stopped gc_thread object across remount restarts so the
> waitqueues are not reinitialized while old waiters can still finish.
> Recheck f2fs_gc_task after prepare_to_wait() so a waiter that races
> with teardown does not sleep after the worker has been withdrawn.
>
> v3:
> Fix checkpatch style issues in the broader lifetime variant.
>
> v2:
> Sashiko.dev pointed out that GC_MERGE foreground waiters and
> GC-thread users needed lifetime-safe access after teardown.
>
> fs/f2fs/gc.c | 52 ++++++++++++++++++++++++++++++-----------------
> fs/f2fs/gc.h | 38 ++++++++++++++++++++++++++++++++++
> fs/f2fs/segment.c | 19 ++++++++++-------
> fs/f2fs/super.c | 7 +++++--
> fs/f2fs/sysfs.c | 18 +++++++++++-----
> 5 files changed, 101 insertions(+), 33 deletions(-)
>
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index ba93010924c0..190d60b3cfc5 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -193,12 +193,24 @@ static int gc_thread_func(void *data)
>
> int f2fs_start_gc_thread(struct f2fs_sb_info *sbi)
> {
> - struct f2fs_gc_kthread *gc_th;
> + struct f2fs_gc_kthread *gc_th = READ_ONCE(sbi->gc_thread);
> + struct task_struct *task;
> + bool allocated = false;
> dev_t dev = sbi->sb->s_bdev->bd_dev;
>
> - gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
> - if (!gc_th)
> - return -ENOMEM;
> + if (gc_th && READ_ONCE(gc_th->f2fs_gc_task))
> + return 0;
> +
> + if (!gc_th) {
> + gc_th = f2fs_kmalloc(sbi, sizeof(*gc_th), GFP_KERNEL);
> + if (!gc_th)
> + return -ENOMEM;
> + gc_th->f2fs_gc_task = NULL;
> + spin_lock_init(&gc_th->gc_task_lock);
> + init_waitqueue_head(&gc_th->gc_wait_queue_head);
> + init_waitqueue_head(&gc_th->fggc_wq);
> + allocated = true;
> + }
>
> gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
> gc_th->valid_thresh_ratio = DEF_GC_THREAD_VALID_THRESH_RATIO;
> @@ -221,34 +233,36 @@ int f2fs_start_gc_thread(struct f2fs_sb_info *sbi)
>
> gc_th->gc_wake = false;
>
> - sbi->gc_thread = gc_th;
> - init_waitqueue_head(&sbi->gc_thread->gc_wait_queue_head);
> - init_waitqueue_head(&sbi->gc_thread->fggc_wq);
> - sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> - "f2fs_gc-%u:%u", MAJOR(dev), MINOR(dev));
> - if (IS_ERR(gc_th->f2fs_gc_task)) {
> - int err = PTR_ERR(gc_th->f2fs_gc_task);
> + task = kthread_create(gc_thread_func, sbi, "f2fs_gc-%u:%u",
> + MAJOR(dev), MINOR(dev));
> + if (IS_ERR(task)) {
> + int err = PTR_ERR(task);
>
> - kfree(gc_th);
> - sbi->gc_thread = NULL;
> + if (allocated)
> + kfree(gc_th);
> return err;
> }
>
> - set_user_nice(gc_th->f2fs_gc_task,
> - PRIO_TO_NICE(sbi->critical_task_priority));
> + set_user_nice(task, PRIO_TO_NICE(sbi->critical_task_priority));
> + if (allocated)
> + WRITE_ONCE(sbi->gc_thread, gc_th);
> + f2fs_set_gc_task(gc_th, task);
> + wake_up_process(task);
> return 0;
> }
>
> void f2fs_stop_gc_thread(struct f2fs_sb_info *sbi)
> {
> - struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
> + struct f2fs_gc_kthread *gc_th = READ_ONCE(sbi->gc_thread);
> + struct task_struct *task;
>
> if (!gc_th)
> return;
> - kthread_stop(gc_th->f2fs_gc_task);
> + task = f2fs_detach_gc_task(gc_th);
> + if (!task)
> + return;
> + kthread_stop(task);
> wake_up_all(&gc_th->fggc_wq);
> - kfree(gc_th);
> - sbi->gc_thread = NULL;
> }
>
> static int select_gc_type(struct f2fs_sb_info *sbi, int gc_type)
> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
> index 6c4d4567571e..6df406e98939 100644
> --- a/fs/f2fs/gc.h
> +++ b/fs/f2fs/gc.h
> @@ -5,6 +5,9 @@
> * Copyright (c) 2012 Samsung Electronics Co., Ltd.
> * http://www.samsung.com/
> */
> +#include <linux/sched/task.h>
> +#include <linux/spinlock.h>
> +
> #define GC_THREAD_MIN_WB_PAGES 1 /*
> * a threshold to determine
> * whether IO subsystem is idle
> @@ -47,6 +50,7 @@
>
> struct f2fs_gc_kthread {
> struct task_struct *f2fs_gc_task;
> + spinlock_t gc_task_lock; /* protects f2fs_gc_task */
> wait_queue_head_t gc_wait_queue_head;
>
> /* for gc sleep time */
> @@ -72,6 +76,40 @@ struct f2fs_gc_kthread {
> unsigned int boost_gc_greedy;
> };
>
> +static inline struct task_struct *f2fs_get_gc_task(struct f2fs_gc_kthread *gc_th)
> +{
> + struct task_struct *task;
> +
> + spin_lock(&gc_th->gc_task_lock);
> + task = READ_ONCE(gc_th->f2fs_gc_task);
> + if (task)
> + get_task_struct(task);
> + spin_unlock(&gc_th->gc_task_lock);
> +
> + return task;
> +}
> +
> +static inline void f2fs_set_gc_task(struct f2fs_gc_kthread *gc_th,
> + struct task_struct *task)
> +{
> + spin_lock(&gc_th->gc_task_lock);
> + WRITE_ONCE(gc_th->f2fs_gc_task, task);
> + spin_unlock(&gc_th->gc_task_lock);
> +}
> +
> +static inline struct task_struct *
> +f2fs_detach_gc_task(struct f2fs_gc_kthread *gc_th)
> +{
> + struct task_struct *task;
> +
> + spin_lock(&gc_th->gc_task_lock);
> + task = READ_ONCE(gc_th->f2fs_gc_task);
> + WRITE_ONCE(gc_th->f2fs_gc_task, NULL);
> + spin_unlock(&gc_th->gc_task_lock);
> +
> + return task;
> +}
> +
> struct gc_inode_list {
> struct list_head ilist;
> struct radix_tree_root iroot;
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index 788f8b050249..5f61627009a5 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -424,6 +424,8 @@ int f2fs_commit_atomic_write(struct inode *inode)
> */
> void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need)
> {
> + struct f2fs_gc_kthread *gc_th;
> +
> if (f2fs_cp_error(sbi))
> return;
>
> @@ -444,15 +446,18 @@ void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need)
> if (has_enough_free_secs(sbi, 0, 0))
> return;
>
> - if (test_opt(sbi, GC_MERGE) && sbi->gc_thread &&
> - sbi->gc_thread->f2fs_gc_task) {
> + gc_th = READ_ONCE(sbi->gc_thread);
> + if (test_opt(sbi, GC_MERGE) && gc_th &&
> + READ_ONCE(gc_th->f2fs_gc_task)) {
> DEFINE_WAIT(wait);
>
> - prepare_to_wait(&sbi->gc_thread->fggc_wq, &wait,
> - TASK_UNINTERRUPTIBLE);
> - wake_up(&sbi->gc_thread->gc_wait_queue_head);
> - io_schedule();
> - finish_wait(&sbi->gc_thread->fggc_wq, &wait);
> + prepare_to_wait(&gc_th->fggc_wq, &wait,
> + TASK_UNINTERRUPTIBLE);
> + if (!f2fs_cp_error(sbi) && READ_ONCE(gc_th->f2fs_gc_task)) {
> + wake_up(&gc_th->gc_wait_queue_head);
> + io_schedule();
> + }
> + finish_wait(&gc_th->fggc_wq, &wait);
> } else {
> struct f2fs_gc_control gc_control = {
> .victim_segno = NULL_SEGNO,
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ccf806b676f5..d6863da05a7c 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -2925,11 +2925,12 @@ static int __f2fs_remount(struct fs_context *fc, struct super_block *sb)
> if ((flags & SB_RDONLY) ||
> (F2FS_OPTION(sbi).bggc_mode == BGGC_MODE_OFF &&
> !test_opt(sbi, GC_MERGE))) {
> - if (sbi->gc_thread) {
> + if (sbi->gc_thread && READ_ONCE(sbi->gc_thread->f2fs_gc_task)) {
> f2fs_stop_gc_thread(sbi);
> need_restart_gc = true;
> }
> - } else if (!sbi->gc_thread) {
> + } else if (!sbi->gc_thread ||
> + !READ_ONCE(sbi->gc_thread->f2fs_gc_task)) {
> err = f2fs_start_gc_thread(sbi);
> if (err)
> goto restore_opts;
> @@ -5451,6 +5452,7 @@ static int f2fs_fill_super(struct super_block *sb, struct fs_context *fc)
> free_sb_buf:
> kfree(raw_super);
> free_sbi:
> + kfree(sbi->gc_thread);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> lockdep_unregister_key(&sbi->cp_global_sem_key);
> #endif
> @@ -5535,6 +5537,7 @@ static void kill_f2fs_super(struct super_block *sb)
> /* Release block devices last, after fscrypt_destroy_keyring(). */
> if (sbi) {
> destroy_device_list(sbi);
> + kfree(sbi->gc_thread);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> lockdep_unregister_key(&sbi->cp_global_sem_key);
> #endif
> diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
> index 352e96ad5c3a..07543df2b5b1 100644
> --- a/fs/f2fs/sysfs.c
> +++ b/fs/f2fs/sysfs.c
> @@ -12,6 +12,7 @@
> #include <linux/seq_file.h>
> #include <linux/unicode.h>
> #include <linux/ioprio.h>
> +#include <linux/sched/task.h>
> #include <linux/sysfs.h>
>
> #include "f2fs.h"
> @@ -981,17 +982,24 @@ static ssize_t __sbi_store(struct f2fs_attr *a,
> }
>
> if (!strcmp(a->attr.name, "critical_task_priority")) {
> + struct f2fs_gc_kthread *gc_th;
> + struct task_struct *gc_task;
> + int nice;
> +
> if (t < NICE_TO_PRIO(MIN_NICE) || t > NICE_TO_PRIO(MAX_NICE))
> return -EINVAL;
> if (!capable(CAP_SYS_NICE))
> return -EPERM;
> sbi->critical_task_priority = t;
> + nice = PRIO_TO_NICE(sbi->critical_task_priority);
> if (sbi->cprc_info.f2fs_issue_ckpt)
> - set_user_nice(sbi->cprc_info.f2fs_issue_ckpt,
> - PRIO_TO_NICE(sbi->critical_task_priority));
> - if (sbi->gc_thread && sbi->gc_thread->f2fs_gc_task)
> - set_user_nice(sbi->gc_thread->f2fs_gc_task,
> - PRIO_TO_NICE(sbi->critical_task_priority));
> + set_user_nice(sbi->cprc_info.f2fs_issue_ckpt, nice);
> + gc_th = READ_ONCE(sbi->gc_thread);
> + gc_task = gc_th ? f2fs_get_gc_task(gc_th) : NULL;
> + if (gc_task) {
> + set_user_nice(gc_task, nice);
> + put_task_struct(gc_task);
> + }
> return count;
> }
>
> --
> 2.43.0
>