[PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs

From: Christian Brauner

Date: Tue Mar 03 2026 - 08:57:58 EST


So this is a bit of a crazy series and I've played around with it for
some time and I kinda need to move on to other stuff so I'm sending out
where I've left this as it's overall in a shape where the approach and
idea can be grasped. There's some kthread cleanups at the beginning as
well that are mostly unrelated but fell out of this work as this
whole approach of dumping ever more special helper functions is not very
sustainable. But anyway...

... When the kernel is started it roughly goes like this:

init_task
==> create pid 1 (systemd etc.)
==> pid 2 (kthreadd)

After this point all kthreads and PID 1 share the same filesystem state.
That obviously already came up when we discussed pivot_root() as this
allows pivot_root() to rewrite the fs_struct of all kthreads.

I kinda hate this rewriting the implicit sharing which is abused left
and right - but who knows maybe others really like it - so I worked a
bit to get rid of it in a more fundamental way. Is it crazy? Yes. Is it
likely broken? Yes. Does it at least boot? Yes.

Instead of sharing fs_struct between kernel threads and pid 1 we give
pid a separate userspace_init_fs struct. All kthreads continue sharing
init_fs as before and userspace_init_fs is isolated from kthread's
filesystem state. IOW, userspace init cannot affect kthreads filesystem
state anymore and kthreads cannot affect userspace's filesystem state
anymore - without explicit opt-in.

This series makes performing mountains of filesystem work such as path
lookup and file opening and so on from kthreads hard - painfully so. I
think this is a benefit because it takes the idea of just offloading
_security sensitive_ operations in init's filesystem state and
running random binaries or opening and creating files to kthreads
difficult behind the shed... And imho it should.

The only remaining kernel tasks that actually share init's filesystem
state are usermodhelpers - as they execute random binaries in the root
filesystem. Another concept we should really show the back of the shed.

This gives a lot stronger guarantees than what we have now. This also
makes path lookup from kthreads fail by default. IOW, it won't be
possible anymore to just lookup random stuff in init's filesytem state
without explicitly opting in to that.

The places that need to perform lookup in init's filesystem state may
use LOOKUP_IN_INIT which will grab userspace_init_fs and use that for
root or pwd. Note that we can't just walk up to the topmost mount
otherwise someone in userspace can do mount -t tmpfs tmpfs / and mess
with a kthreads lookup state. We also sometimes might need the working
directory.

We now also warn and notice when pid 1 simply stops sharing filesystem
state with us, i.e., abandons it's userspace_init_fs.

On older kernels if PID 1 unshared its filesystem state with us the
kernel simply used the stale fs_struct state implicitly pinning
anything that PID 1 had last used. Even if PID 1 might've moved on to
some completely different fs_struct state and might've even unmounted
the old root.

This has hilarious consequences: Think continuing to dump coredump
state into an implicitly pinned directory somewhere. Calling random
binaries in the old rootfs via usermodehelpers.

Be aggressive about this: We simply reject operating on stale
fs_struct state by reverting userspace_init_fs to nullfs. Every kworker
that does lookups after this point will fail. Every usermodehelper call
will fail. This is a lot stronger but I wouldn't know what it means for
pid 1 to simply stop sharing its fs state with the kernel. Clearly it
wanted to separate so cut all ties.

I've went through the kernel and looked at hopefully everything that
does path lookup from kthreads (workqueues, ...).

The only really unfortunate place is initramfs unpacking because it runs
mostly from a workqueue but if there' "too much work" pending it will
fallback to synchronous in-task execution. Ideally it just always go
async instead of this weird fallback.

TL;DR:

root@localhost:~# stat --file-system /proc/1/root
File: "/proc/1/root"
ID: e3cb00dd533cd3d7 Namelen: 255 Type: ext2/ext3

root@localhost:~# stat --file-system /proc/2/root
File: "/proc/2/root"
ID: 200000000 Namelen: 255 Type: nullfs

=========================================================================
Here's my review. It's long and ugly, I might have missed stuff:
=========================================================================

==== 1. devtmpfs -- kdevtmpfs kthread ====
Dedicated kthread sharing init_fs (nullfs).

```
kernel_init_freeable() # PID 1
-> do_basic_setup()
-> driver_init()
-> devtmpfs_init()
-> kthread_run(devtmpfsd, &err, "kdevtmpfs")
-> devtmpfsd() # kdevtmpfs kthread context
-> devtmpfs_setup() # runs IN the kthread
-> devtmpfs_work_loop() # runtime loop IN the kthread
```

`devtmpfs_setup()` runs inside the kdevtmpfs kthread, NOT PID 1. However, it is
safe because:

- `ksys_unshare(CLONE_NEWNS)` implies `CLONE_FS` giving the kthread a
**private** copy of init_fs.
- `init_mount("devtmpfs", "/", ...)` mounts devtmpfs over the nullfs root
- `init_chdir("/.."); init_chroot(".")` chroots into the devtmpfs mount

All runtime paths (`handle_create`, `handle_remove`, `create_path`,
`delete_path`) operate within this private chroot via
`devtmpfs_work_loop()`.

**No conversion needed**

==== 2. ksmbd -- `ksmbd-io` workqueue

Let's ignore for a second that this basically does all I/O from
kthread context and the security implications of this...

Heaviest subsystem user. Every SMB file operation goes through workqueue
path lookups. Per-connection kthreads (`ksmbd_conn_handler_loop`) read
requests and dispatch to the `ksmbd-io` workqueue via
`handle_ksmbd_work()`.

**Converted to LOOKUP_IN_INIT**

==== 3. nfsd -- kthreads + laundromat workqueue ====

nfsd service threads are kthreads spawned via `kthread_create_on_node` in
`svc_new_thread()`. The `nfsd()` threadfn is passed through
`svc_create_pooled()` -> `serv->sv_threadfn`.

**Service kthreads (`nfsd()` threadfn):**

The nfsd kthreads call `unshare_fs_struct()` on startup for umask control
(`current->fs->umask = 0`), not for path lookups. NFS request handling
dispatches through `svc_recv()` -> NFS procedure handlers which use
**filehandle-based resolution** (`fh_verify()` etc.) relative to export
mount points. They never resolve paths from `current->fs->root`.

**No conversion needed**

==== 4. kernel_init (PID 1 before execve) ====

All `init_*()` wrappers in `fs/init.c` do `kern_path()` or
`filename_create()`/`filename_parentat()`. The lookup API table is
listed once here; the callchains below show every path that reaches them
from PID 1.

**Callchain 1: kernel_init() direct**

```
kernel_init()
-> do_sysctl_args()
-> process_sysctl_arg()
-> file_open_root_mnt() # uses kern_mount'd procfs, not fs_struct
```

**Callchain 2: kernel_init_freeable() direct**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> console_on_rootfs()
-> filp_open("/dev/console", ...)
-> init_eaccess(ramdisk_execute_command)
-> kern_path()
```

**Callchain 3: prepare_namespace() -> mount_root()**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> mount_root()
-> mount_root_generic()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_nodev_root()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_nfs_root()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_cifs_root()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_block_root()
-> create_dev()
-> init_unlink()
-> init_mknod()
-> mount_root_generic()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
```

**Callchain 4: prepare_namespace() -> initrd_load()**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> initrd_load()
-> create_dev()
-> init_unlink()
-> init_mknod()
-> rd_load_image()
-> filp_open() (x2)
-> init_unlink()
```

**Callchain 5: prepare_namespace() -> devtmpfs_mount()**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> devtmpfs_mount()
-> init_mount("devtmpfs", "dev", ...)
```

Note: this is `devtmpfs_mount()` called from PID 1 context (mounts
devtmpfs at /dev after the real root is mounted). Distinct from
`devtmpfs_setup()` which runs in the kdevtmpfs kthread (section 1).

**Callchain 6: prepare_namespace() -> pivot + umount**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> init_pivot_root(".", ".") # kern_path() x2
-> init_umount(".", MNT_DETACH) # kern_path()
```

**Callchain 7: prepare_namespace() -> md_run_setup()**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> md_run_setup()
-> md_setup_drive()
-> init_stat()
```

**Callchain 8: do_basic_setup() -> do_initcalls() (rootfs_initcall)**

```
kernel_init() # PID 1
-> kernel_init_freeable()
-> do_basic_setup()
-> do_initcalls()
-> rootfs_initcall(default_rootfs)
-> default_rootfs()
-> init_mkdir("/dev", 0755)
-> init_mknod("/dev/console", ...)
-> init_mkdir("/root", 0700)
```

Only used when `CONFIG_BLK_DEV_INITRD` is not set (no initramfs).

PID 1 uses pid1_fs which points to the initramfs (set by
`init_chroot_to_overmount()` at the start of `kernel_init()`). The
correct context is available so nothing to worry about.

**No conversion needed**

==== 5. Initramfs/initrd unpacking -- async kworker ====

`do_populate_rootfs()` runs as `async_schedule_domain()` callback
(kworker). When `initramfs_async=0` it runs synchronously in
`kernel_init` context instead.

**Async workqueue creation:**

```
async_init()
-> alloc_workqueue("async", WQ_UNBOUND, 0) # line 359
```

**Async scheduling chain (how do_populate_rootfs ends up in kworker):**

```
kernel_init() # PID 1
-> init_fs() # init/main.c -- switches PID 1 to pid1_fs (rootfs)
-> kernel_init_freeable()
-> do_basic_setup()
-> do_initcalls()
-> rootfs_initcall(populate_rootfs) # init/initramfs.c:791
-> populate_rootfs() # init/initramfs.c:782
-> async_schedule_domain(do_populate_rootfs, NULL, &initramfs_domain) # line 784
-> async_schedule_node_domain() # include/linux/async.h:69
-> __async_schedule_node_domain() # kernel/async.c:150
-> INIT_WORK(&entry->work, async_run_entry_fn) # line 162
-> entry->func = do_populate_rootfs # line 163
-> queue_work_node(node, async_wq, &entry->work) # line 180
-> kworker picks up work item # async_wq = "async" WQ_UNBOUND workqueue
-> async_run_entry_fn() # kernel/async.c:122
-> entry->func(entry->data, entry->cookie) # line 139
-> do_populate_rootfs(NULL, cookie) # RUNS IN KWORKER CONTEXT
```

Note: `async_schedule_node_domain()` has an OOM fallback that runs
`func(data, newcookie)` synchronously in the caller's context (PID 1)
if `kzalloc` fails or `entry_count > MAX_WORK` (kernel/async.c:215-221).
In that case the function runs safely in PID 1. The async kworker case
is the one that needs conversion.

Work items execute in kworker kthreads (children of kthreadd, share init_fs).
The kworker's `current->fs` is `init_fs` which now points to **nullfs**.

**Callchain 1: do_name() regular file creation (S_ISREG)**

```
do_populate_rootfs() # kworker context (async_wq)
-> unpack_to_rootfs(__initramfs_start, __initramfs_size) # init/initramfs.c:721
-> write_buffer() # init/initramfs.c:465
-> actions[GotName] = do_name() # init/initramfs.c:361
-> clean_path(collected, mode) # init/initramfs.c:378
-> init_stat(path, &st, AT_SYMLINK_NOFOLLOW) # init/initramfs.c:337
-> kern_path() # fs/init.c:150
-> filename_lookup(AT_FDCWD, ...) # fs/namei.c:2836
-> path_lookupat() # fs/namei.c:2813
-> path_init() # fs/namei.c:2673
-> nd_jump_root() # absolute paths
-> set_root() # uses current->fs = init_fs (NULLFS)
-> init_rmdir(path) # init/initramfs.c:340 (if S_ISDIR)
-> filename_rmdir(AT_FDCWD, name) # fs/init.c:194
-> filename_parentat() -> path_parentat() -> path_init() -> current->fs (NULLFS)
-> init_unlink(path) # init/initramfs.c:342 (if not S_ISDIR)
-> filename_unlinkat(AT_FDCWD, name) # fs/init.c:182
-> filename_parentat() -> path_parentat() -> path_init() -> current->fs (NULLFS)
-> maybe_link() # init/initramfs.c:380
-> find_link() # init/initramfs.c:90 (hardlink hash lookup)
[if hardlink found:]
-> clean_path(collected, 0) # same as above (init_stat/init_rmdir/init_unlink)
-> init_link(old, collected) # init/initramfs.c:352
-> filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0) # fs/init.c:169
-> filename_lookup(olddfd, old, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> filename_create(newdfd, new, ...) # -> filename_parentat() -> path_init() -> NULLFS
[if not hardlink:]
-> filp_open(collected, O_WRONLY|O_CREAT|O_LARGEFILE, mode) # init/initramfs.c:385
-> file_open_name() # fs/open.c:1338
-> do_file_open(AT_FDCWD, name, &op) # fs/open.c:1322
-> path_openat() # fs/namei.c:4821
-> path_init() # -> nd_jump_root() -> set_root() -> NULLFS
-> vfs_fchown(wfile, uid, gid) # init/initramfs.c:391 (on already-open file, SAFE)
-> vfs_fchmod(wfile, mode) # init/initramfs.c:392 (on already-open file, SAFE)
-> vfs_truncate(&wfile->f_path, body_len) # init/initramfs.c:394 (on already-open path, SAFE)
```

**Callchain 2: do_name() directory creation (S_ISDIR)**

```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer() -> do_name()
-> clean_path(collected, mode) # init/initramfs.c:378 (same as callchain 1)
-> init_mkdir(collected, mode) # init/initramfs.c:398
-> filename_mkdirat(AT_FDCWD, name, mode) # fs/init.c:188
-> filename_create(AT_FDCWD, name, ...) # fs/namei.c:4903
-> filename_parentat(AT_FDCWD, name, ...) # fs/namei.c:2900
-> __filename_parentat() # fs/namei.c:2875
-> path_parentat() # fs/namei.c:2858
-> path_init() # -> nd_jump_root() -> set_root() -> NULLFS
-> init_chown(collected, uid, gid, 0) # init/initramfs.c:399
-> kern_path(filename, LOOKUP_FOLLOW, &path) # fs/init.c:106
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> init_chmod(collected, mode) # init/initramfs.c:400
-> kern_path(filename, LOOKUP_FOLLOW, &path) # fs/init.c:123
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> dir_add(collected, name_len, mtime) # init/initramfs.c:401 (saves for later dir_utime)
```

**Callchain 3: do_name() device/pipe/socket creation (S_ISBLK/S_ISCHR/S_ISFIFO/S_ISSOCK)**

```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer() -> do_name()
-> clean_path(collected, mode) # init/initramfs.c:378 (same as callchain 1)
-> maybe_link() # init/initramfs.c:404
[if not hardlink:]
-> init_mknod(collected, mode, rdev) # init/initramfs.c:405
-> filename_mknodat(AT_FDCWD, name, mode, dev) # fs/init.c:162
-> filename_create(AT_FDCWD, name, ...) # fs/namei.c:4903
-> filename_parentat() # -> path_parentat() -> path_init() -> NULLFS
-> init_chown(collected, uid, gid, 0) # init/initramfs.c:406
-> kern_path() # -> filename_lookup() -> path_init() -> NULLFS
-> init_chmod(collected, mode) # init/initramfs.c:407
-> kern_path() # -> filename_lookup() -> path_init() -> NULLFS
-> do_utime(collected, mtime) # init/initramfs.c:408
-> init_utimes(filename, t) # init/initramfs.c:136
-> kern_path(filename, 0, &path) # fs/init.c:202
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
```

**Callchain 4: do_symlink() symlink creation (S_ISLNK)**

```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer()
-> actions[GotSymlink] = do_symlink() # init/initramfs.c:436
-> clean_path(collected, 0) # init/initramfs.c:445
-> init_stat() -> kern_path() # -> path_init() -> NULLFS
-> init_rmdir() or init_unlink() # -> filename_parentat() -> path_init() -> NULLFS
-> init_symlink(collected + N_ALIGN(name_len), collected) # init/initramfs.c:446
-> filename_symlinkat(old, AT_FDCWD, new) # fs/init.c:176
-> filename_create(AT_FDCWD, new, ...) # fs/namei.c:4903
-> filename_parentat() # -> path_parentat() -> path_init() -> NULLFS
-> init_chown(collected, uid, gid, AT_SYMLINK_NOFOLLOW) # init/initramfs.c:447
-> kern_path(filename, 0, &path) # fs/init.c:106 (lookup_flags = 0, no LOOKUP_FOLLOW)
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> do_utime(collected, mtime) # init/initramfs.c:448
-> init_utimes(filename, t) # init/initramfs.c:136
-> kern_path(filename, 0, &path) # fs/init.c:202
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
```

**Callchain 5: dir_utime() directory timestamp fixup (CONFIG_INITRAMFS_PRESERVE_MTIME)**

```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
[at end of unpack_to_rootfs, after all cpio entries processed:]
-> dir_utime() # init/initramfs.c:567
-> list_for_each_entry_safe(de, ...) # init/initramfs.c:168
-> do_utime(de->name, de->mtime) # init/initramfs.c:170
-> init_utimes(filename, t) # init/initramfs.c:136
-> kern_path(filename, 0, &path) # fs/init.c:202
-> filename_lookup(AT_FDCWD, ...) # fs/namei.c:2836
-> path_lookupat() # -> path_init() -> NULLFS
```

**Callchain 6: populate_initrd_image() non-cpio initrd (CONFIG_BLK_DEV_RAM)**

```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs((char *)initrd_start, ...) # init/initramfs.c:733 (returns error for non-cpio)
[err != NULL && CONFIG_BLK_DEV_RAM:]
-> populate_initrd_image(err) # init/initramfs.c:736
-> filp_open("/initrd.image", O_WRONLY|O_CREAT|O_LARGEFILE, 0700) # init/initramfs.c:705
-> file_open_name(name, flags, mode) # fs/open.c:1338
-> do_file_open(AT_FDCWD, name, &op) # fs/open.c:1322
-> path_openat(&nd, op, flags) # fs/namei.c:4821
-> path_init(&nd, flags) # fs/namei.c:2673
-> nd_jump_root() # absolute path "/"
-> set_root() # uses current->fs = init_fs (NULLFS)
-> xwrite(file, ...) # init/initramfs.c:709 (write to already-open file, SAFE)
-> fput(file) # init/initramfs.c:714
```

**Callchain 7: do_name() hardlink via maybe_link()**

```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer() -> do_name()
-> clean_path(collected, mode) # init/initramfs.c:378 (same as callchain 1)
[S_ISREG(mode):]
-> maybe_link() # init/initramfs.c:380
-> find_link(major, minor, ino, mode, collected) # init/initramfs.c:90
[returns non-NULL old name for nlink >= 2 and matching hash entry:]
-> clean_path(collected, 0) # init/initramfs.c:351
-> init_stat() -> kern_path() # -> path_init() -> NULLFS
-> init_rmdir() or init_unlink() # -> path_init() -> NULLFS
-> init_link(old, collected) # init/initramfs.c:352
-> filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0) # fs/init.c:169
-> filename_lookup(AT_FDCWD, old, 0, &old_path, NULL) # fs/namei.c:5816
-> path_lookupat() -> path_init() # -> NULLFS
-> filename_create(AT_FDCWD, new, &new_path, 0) # fs/namei.c:5822
-> filename_parentat() -> path_parentat() -> path_init() # -> NULLFS
```

When `initramfs_async=1` (the default), `do_populate_rootfs()` runs
in an async kworker. The kworker's `current->fs` is `init_fs` which
now points to **nullfs**. All path lookups resolve "/" against the
nullfs root.

The rootfs (initramfs) is overmounted on top of nullfs's root dentry.
However, `path_init()` does **not** follow overmounts when establishing
the starting point — it sets `nd->path` to the raw `current->fs->root`
(nullfs root dentry on nullfs vfsmount). Mount following only occurs
during component-by-component traversal in `link_path_walk()` via
`step_into()` -> `handle_mounts()`. Since the starting dentry is the
nullfs root (below the overmount), component lookups call nullfs's
`->lookup` which returns -ENOENT (nullfs has no directory entries).

**Result: all `init_*()` and `filp_open()` calls will fail with -ENOENT
in async kworker context.**

When `initramfs_async=0`, `populate_rootfs()` calls
`wait_for_initramfs()` which calls `async_synchronize_cookie_domain()`
to wait for the async work to complete. But the work was already queued
to the async_wq workqueue — `wait_for_initramfs` does not change which
context runs the work. The work still runs in a kworker.

However, there is the OOM fallback: if `kzalloc` fails in
`async_schedule_node_domain()`, the function runs synchronously in PID 1
context (safe).

**Converted to LOOKUP_IN_INIT**

==== 6. Firmware loader -- system workqueue ====

Reached via `request_firmware_nowait()` -> workqueue ->
`request_firmware_work_func()`, and also via synchronous
`request_firmware()` from any kthread caller (428+ callers across
drivers).

Already uses `kernel_read_file_from_path_initns()` which calls
`init_root()`.

**Converted**

==== 7. IMA/EVM integrity -- kernel_init kthread ===

kernel_init() # init/main.c
-> kernel_init_freeable()
-> integrity_load_keys() # hook, called when rootfs is ready
+- ima_load_x509()
| -> integrity_load_x509()
| -> kernel_read_file_from_path() # NOT _initns
+- evm_load_x509() # if !CONFIG_IMA_LOAD_X509
-> integrity_load_x509()
-> kernel_read_file_from_path() # NOT _initns

This is called from PID 1 before init is exec'd where we are chrooted into
the initramfs. The correct context will be available so nothing to worry about.

**No conversion needed**

==== 8. Btrfs -- `btrfs-devrepl` kthread ====

**Kthread creation:**

```
open_ctree() / btrfs_remount_rw() # mount/remount context
-> btrfs_start_pre_rw_mount() # fs/btrfs/disk-io.c:3038
-> btrfs_resume_dev_replace_async() # fs/btrfs/dev-replace.c:1188
-> kthread_run(btrfs_dev_replace_kthread, ..., "btrfs-devrepl") # line 1237
-> kthread_create() -> kthreadd -> kernel_thread(CLONE_FS|CLONE_FILES|SIGCHLD)
```

Dedicated kthread sharing init_fs (nullfs).

**Callchain 1 (kthread -- NEEDS CONVERSION):**

```
btrfs_dev_replace_kthread() # fs/btrfs/dev-replace.c:1239 [kthread context]
-> btrfs_scrub_dev()
-> btrfs_dev_replace_finishing() # fs/btrfs/dev-replace.c:856
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```

**Callchain 2 (kthread, error path -- NEEDS CONVERSION):**

```
btrfs_dev_replace_kthread() # fs/btrfs/dev-replace.c:1239 [kthread context]
-> btrfs_dev_replace_finishing() # fs/btrfs/dev-replace.c:856
-> btrfs_destroy_dev_replace_tgtdev() # fs/btrfs/volumes.c:2512 (error/cleanup)
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```

**Callchain 3 (ioctl DEV_REPLACE_CMD_START -- SAFE: user context):**

```
btrfs_ioctl() # user syscall context
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_by_ioctl() # fs/btrfs/dev-replace.c:730
-> btrfs_dev_replace_start() # fs/btrfs/dev-replace.c:584
-> btrfs_dev_replace_finishing() # fs/btrfs/dev-replace.c:856
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```

**Callchain 4 (ioctl DEV_REPLACE_CMD_START, error -- SAFE: user context):**

```
btrfs_ioctl()
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_by_ioctl() # fs/btrfs/dev-replace.c:730
-> btrfs_dev_replace_start() # fs/btrfs/dev-replace.c:584
-> btrfs_destroy_dev_replace_tgtdev() # error/leave path, line 711
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```

**Callchain 5 (ioctl DEV_REPLACE_CMD_START, nested error -- SAFE: user context):**

```
btrfs_ioctl()
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_by_ioctl()
-> btrfs_dev_replace_start()
-> btrfs_dev_replace_finishing()
-> btrfs_destroy_dev_replace_tgtdev() # error within finishing
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```

**Callchain 6 (ioctl DEV_REPLACE_CMD_CANCEL -- SAFE: user context):**

```
btrfs_ioctl()
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_cancel() # fs/btrfs/dev-replace.c:1075
-> btrfs_destroy_dev_replace_tgtdev() # fs/btrfs/volumes.c:2512
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```

**Callchain 7 (ioctl BTRFS_IOC_RM_DEV -- SAFE: user context):**

```
btrfs_ioctl()
-> btrfs_ioctl_rm_dev() # fs/btrfs/ioctl.c:2582
-> btrfs_rm_device() # fs/btrfs/volumes.c:2288
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time()
-> kern_path()
```

**Callchain 8 (ioctl BTRFS_IOC_RM_DEV_V2 -- SAFE: user context):**

```
btrfs_ioctl()
-> btrfs_ioctl_rm_dev_v2() # fs/btrfs/ioctl.c:2514
-> btrfs_rm_device() # fs/btrfs/volumes.c:2288
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```

**Callchain 9 (ioctl BTRFS_IOC_ADD_DEV -- SAFE: user context):**

```
btrfs_ioctl()
-> btrfs_ioctl_add_dev() # fs/btrfs/ioctl.c:2455
-> btrfs_init_new_device() # fs/btrfs/volumes.c:2802
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```

Only callchains 1 and 2 (the `btrfs-devrepl` kthread and its error
path) need conversion. All other paths are ioctl/user syscall context.

**Converted to LOOKUP_IN_INIT**

==== 9. SCSI Target (LIO) -- target workqueues ====

Via `target_queued_submit_work` / `target_complete_ok_work` workqueues.

**Workqueue creation:**

```
module_init(target_core_init_configfs) # drivers/target/target_core_configfs.c:3852
-> init_se_kmem_caches() # drivers/target/target_core_transport.c:60
-> alloc_workqueue("target_completion", WQ_MEM_RECLAIM|WQ_PERCPU, 0) # line 128
-> alloc_workqueue("target_submission", WQ_MEM_RECLAIM|WQ_PERCPU, 0) # line 133
```

Work items execute in kworker kthreads (children of kthreadd, share init_fs).

**Converted to LOOKUP_IN_INIT**

==== 10. RNBD server -- RDMA CQ workqueue (IB_POLL_WORKQUEUE) ====

**Workqueue creation:**

The underlying workqueue is `ib_comp_wq`:

```
module_init(ib_core_init) # drivers/infiniband/core/device.c:2994
-> alloc_workqueue("ib-comp-wq",
WQ_HIGHPRI|WQ_MEM_RECLAIM|WQ_SYSFS|WQ_PERCPU, 0) # line 3007
```

At connection time, CQ completion work is bound to `ib_comp_wq`:

```
rtrs_srv_rdma_cm_handler()
-> create_con() # drivers/infiniband/ulp/rtrs/rtrs-srv.c:1704
-> rtrs_cq_qp_create(..., IB_POLL_WORKQUEUE) # line 1759
-> ib_alloc_cq() -> __ib_alloc_cq() # drivers/infiniband/core/cq.c:212
-> cq->comp_wq = ib_comp_wq # line 276
```

Work items execute in kworker kthreads (children of kthreadd, share init_fs).

**Converted to LOOKUP_IN_INIT**

==== 11. NFS client pNFS block layout -- rpciod/nfsiod workqueue (potentially) ====

**Workqueue creation:**

```
module_init(init_nfs_fs) # fs/nfs/inode.c:2809
-> nfsiod_start() # fs/nfs/inode.c:2620
-> alloc_workqueue("nfsiod", WQ_MEM_RECLAIM|WQ_UNBOUND, 0) # line 2627
```

Work items execute in kworker kthreads (children of kthreadd, share init_fs).

**Converted to LOOKUP_IN_INIT**

==== 12. NFS4 referral -- automount ====

`nfs4_submount()` is an automount callback triggered during path walk.
Always user process context.

**No conversion needed**

==== 13. Cachefiles -- fscache cookie workers ====

The fscache cookie worker path is workqueue context:

```
fscache_cookie_worker() [work_struct]
-> fscache_cookie_state_machine()
-> fscache_perform_lookup() -> cachefiles_lookup_cookie()
-> cachefiles_look_up_object() -> lookup_one_positive_unlocked()
-> fscache_perform_invalidation() -> cachefiles_invalidate_cookie()
-> cachefiles_bury_object() -> lookup_one()
```

However, `lookup_one()` and `lookup_one_positive_unlocked()` are **dentry-level
lookups** relative to a parent dentry. They do NOT use
`current->fs->root` for path resolution.

The `cachefiles_add_cache()` -> `kern_path()` path is daemon ioctl context
(user process).

**No conversion needed**

==== 14. Audit subsystem -- netlink handler ====

The `kern_path()` calls are all reached via:

```
audit_receive() -> audit_receive_msg() -> audit_trim_trees() / audit_tag_tree()
```

`audit_receive()` is a netlink callback running in the **context of the
userspace process** (auditctl) that sent the netlink message. The
`prune_tree_thread` kthread (launched by `audit_launch_prune()`) calls
`prune_one()` which does NOT do path lookups. **No conversion needed.**

==== 15. AMD SEV -- `__init` path ====

Uses `init_root()` via `open_file_as_root()` -> `file_open_root()`
(drivers/crypto/ccp/sev-dev.c:265).

**Converted**

==== 16. Overlayfs -- VFS operation context ====

Triggered from `ovl_open()` / `ovl_d_real()` -- inherits caller's
context.

Uses `vfs_path_lookup(layer->mnt->mnt_root, layer->mnt, ...)` with an
explicit root/vfsmount. Does not go through fs_struct root at all.

**No conversion needed.**

==== 17. Module init (kthread context when built-in) ====

Note: `early_boot_devpath()` uses `early_lookup_bdev()` (not `kern_path`)
for the device lookup, but then calls `init_unlink()` and `init_mknod()`
which perform path lookups via `filename_unlinkat()` and
`filename_mknodat()`.

Built-in `module_init` runs in PID 1 context . Module loaded at runtime
runs in modprobe context (user process, safe).

**No conversion needed.**

==== 18. EROFS -- mount operation ====

`erofs_fc_get_tree()` is the `.get_tree` callback in `erofs_context_ops`
(`fs/erofs/super.c:884`), invoked via `vfs_get_tree()`.

The `filp_open()` call happens only on the `CONFIG_EROFS_FS_BACKED_BY_FILE`
path, when `get_tree_bdev_flags()` returns `-ENOTBLK` and the source is a
regular file.

**kthread path (boot-time root mount):**

```
kernel_init_freeable()
-> prepare_namespace()
-> mount_root()
-> mount_root_generic() / mount_nodev_root()
-> do_mount_root()
-> init_mount()
-> path_mount()
-> do_new_mount()
-> vfs_get_tree()
-> erofs_fc_get_tree()
```

This is reachable when erofs is used as the root filesystem
(`rootfstype=erofs`). At this point PID 1 is spawned via
`user_mode_thread(kernel_init, ...)` but has not yet exec'd the
userspace init binary. It will have the correct lookup context as
we chrooted into initramfs.

**No conversion needed.**

Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
Christian Brauner (11):
kthread: refactor __kthread_create_on_node() to take a struct argument
kthread: remove unused flags argument from kthread worker creation API
kthread: add extensible kthread_create()/kthread_run() pattern
fs: notice when init abandons fs sharing
fs: add LOOKUP_IN_INIT
fs: add file_open_init()
block: add bdev_file_open_init()
fs: allow to pass lookup flags to filename_*()
fs: add init_root()
tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT
fs: isolate all kthreads in nullfs

arch/x86/kvm/i8254.c | 2 +-
block/bdev.c | 60 ++++++--
crypto/crypto_engine.c | 2 +-
drivers/block/rnbd/rnbd-srv.c | 2 +-
drivers/char/misc_minor_kunit.c | 2 +-
drivers/cpufreq/cppc_cpufreq.c | 2 +-
drivers/crypto/ccp/sev-dev.c | 4 +-
drivers/dpll/zl3073x/core.c | 2 +-
drivers/gpu/drm/drm_vblank_work.c | 6 +-
.../gpu/drm/i915/gem/selftests/i915_gem_context.c | 4 +-
drivers/gpu/drm/i915/gt/selftest_execlists.c | 2 +-
drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 4 +-
drivers/gpu/drm/i915/gt/selftest_slpc.c | 2 +-
drivers/gpu/drm/i915/selftests/i915_request.c | 12 +-
drivers/gpu/drm/msm/disp/msm_disp_snapshot.c | 2 +-
drivers/gpu/drm/msm/msm_atomic.c | 2 +-
drivers/gpu/drm/msm/msm_gpu.c | 2 +-
drivers/gpu/drm/msm/msm_kms.c | 2 +-
.../media/platform/chips-media/wave5/wave5-vpu.c | 2 +-
drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
drivers/net/ethernet/intel/ice/ice_dpll.c | 4 +-
drivers/net/ethernet/intel/ice/ice_gnss.c | 2 +-
drivers/net/ethernet/intel/ice/ice_ptp.c | 4 +-
drivers/platform/chrome/cros_ec_spi.c | 2 +-
drivers/ptp/ptp_clock.c | 2 +-
drivers/spi/spi.c | 2 +-
drivers/target/target_core_alua.c | 2 +-
drivers/target/target_core_pr.c | 2 +-
drivers/usb/gadget/function/uvc_video.c | 2 +-
drivers/usb/typec/tcpm/tcpm.c | 2 +-
drivers/vdpa/vdpa_sim/vdpa_sim.c | 4 +-
drivers/watchdog/watchdog_dev.c | 2 +-
fs/btrfs/volumes.c | 6 +-
fs/coredump.c | 8 +-
fs/erofs/zdata.c | 2 +-
fs/fs_struct.c | 92 ++++++++++++
fs/init.c | 23 +--
fs/internal.h | 18 ++-
fs/kernel_read_file.c | 4 +-
fs/namei.c | 71 +++++----
fs/namespace.c | 4 -
fs/nfs/blocklayout/dev.c | 4 +-
fs/open.c | 25 +++
fs/smb/server/mgmt/share_config.c | 3 +-
fs/smb/server/smb2pdu.c | 2 +-
fs/smb/server/vfs.c | 6 +-
include/linux/blkdev.h | 2 +
include/linux/fs.h | 1 +
include/linux/fs_struct.h | 5 +
include/linux/init_task.h | 1 +
include/linux/kthread.h | 97 +++++++-----
include/linux/namei.h | 3 +-
include/linux/sched/task.h | 1 +
init/initramfs.c | 4 +-
init/initramfs_test.c | 4 +-
init/main.c | 10 +-
io_uring/fs.c | 10 +-
kernel/fork.c | 40 +++--
kernel/kthread.c | 167 ++++++++++++++-------
kernel/rcu/tree.c | 4 +-
kernel/sched/ext.c | 2 +-
kernel/workqueue.c | 2 +-
net/dsa/tag_ksz.c | 4 +-
net/dsa/tag_ocelot_8021q.c | 2 +-
net/dsa/tag_sja1105.c | 4 +-
net/unix/af_unix.c | 4 +-
66 files changed, 526 insertions(+), 257 deletions(-)
---
base-commit: 10047142d6ce3b8562546c61f3cf57f852b9b950
change-id: 20260303-work-kthread-nullfs-875a837f4198