[PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

From: Christian Brauner

Date: Thu Mar 05 2026 - 18:30:35 EST


Summary:

* all kthreads are isolated in a separate SB_KERNMOUNT of nullfs.
-> no lookup of anything else, no mounting on top of it, completely
isolated.
* init has a separate fs_struct from all kthreads
* scoped_with_init_fs() allows a kthread to temporarily assume init's
fs_struct for filesystem operations.

So this is a bit of a crazy series. When the kernel is started it
roughly goes like this:

init_task
==> create pid 1 (systemd etc.)
==> pid 2 (kthreadd)

After this point all kthreads and PID 1 share the same filesystem state.
That obviously already came up when we discussed pivot_root() as this
allows pivot_root() to rewrite the fs_struct of all kthreads.

This rewriting is really weird and mostly done so kthread can use init's
filesystem state when they would like to. But this really should be
discouraged. The rewriting should also stop completely. I worked a bit
to get rid of it in a more fundamental way. Is it crazy? Yes. Is it
likely broken? Yes. Does it at least boot? Yes.

Instead of sharing fs_struct between kernel threads and pid 1, pid 1
get's a completely separate fs_struct. All kthreads continue sharing
init_fs as before and pid 1's fs_struct is isolated from kthread's
filesystem state. IOW, userspace init cannot affect kthreads filesystem
state anymore and kthreads cannot affect userspace's filesystem state
anymore - without explicit opt-in.

All kthreads are anchored in a kernel internal mount of nullfs that
cannot be mounted on and that cannot be used to follow other mounts.
It's a completely private mount that insulates kthreads.

This series makes performing mountains of filesystem work such as path
lookup and file opening and so on from kthreads hard - painfully so. I
think this is a benefit because it takes the idea of just offloading
_security sensitive_ operations in init's filesystem state and
running random binaries or opening and creating files to kthreads
difficult behind the shed... And imho it should.

The only remaining kernel tasks that actually share init's filesystem
state are usermodhelpers - as they execute random binaries in the root
filesystem. Another concept we should really show the back of the shed.

This gives a lot stronger guarantees than what we have now. This also
makes path lookup from kthreads fail by default. IOW, it won't be
possible anymore to just lookup random stuff in init's filesytem state
without explicitly opting in to that.

The places that need to perform lookup in init's filesystem state may
use scoped_with_init_fs() which will temporarily override the caller's
fs_struct with init's fs_struct.

We now also warn and notice when pid 1 simply stops sharing filesystem
state with us, i.e., abandons it's userspace_init_fs.

On older kernels if PID 1 unshared its filesystem state with us the
kernel simply used the stale fs_struct state implicitly pinning
anything that PID 1 had last used. Even if PID 1 might've moved on to
some completely different fs_struct state and might've even unmounted
the old root.

This has hilarious consequences: Think continuing to dump coredump
state into an implicitly pinned directory somewhere. Calling random
binaries in the old rootfs via usermodehelpers.

Be aggressive about this: We simply reject operating on stale
fs_struct state by reverting userspace_init_fs to nullfs. Every kworker
that does lookups after this point will fail. Every usermodehelper call
will fail. This is a lot stronger but I wouldn't know what it means for
pid 1 to simply stop sharing its fs state with the kernel. Clearly it
wanted to separate so cut all ties.

I've went through the kernel and looked at hopefully everything that
does path lookup from kthreads (workqueues, ...).

TL;DR:

==== PID 1 (systemd) ====

root@localhost:~# stat --file-system /proc/1/root
File: "/proc/1/root"
ID: e3cb00dd533cd3d7 Namelen: 255 Type: ext2/ext3

root@localhost:~# cat /proc/1/mountinfo | wc -l
30

==== PID 2 (kthreadd) ====

root@localhost:~# stat --file-system /proc/2/root
File: "/proc/2/root"
ID: 200000000 Namelen: 255 Type: nullfs

root@localhost:~# cat /proc/2/mountinfo | wc -l
0

Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
Changes in v2:
- Remove LOOKUP_IN_INIT in favor of scoped_with_init_fs().
- Link to v1: https://patch.msgid.link/20260303-work-kthread-nullfs-v1-0-87e559b94375@xxxxxxxxxx

---
Christian Brauner (23):
fs: notice when init abandons fs sharing
fs: add scoped_with_init_fs()
rnbd: use scoped_with_init_fs() for block device open
crypto: ccp: use scoped_with_init_fs() for SEV file access
scsi: target: use scoped_with_init_fs() for ALUA metadata
scsi: target: use scoped_with_init_fs() for APTPL metadata
btrfs: use scoped_with_init_fs() for update_dev_time()
coredump: use scoped_with_init_fs() for coredump path resolution
fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns()
ksmbd: use scoped_with_init_fs() for share path resolution
ksmbd: use scoped_with_init_fs() for filesystem info path lookup
ksmbd: use scoped_with_init_fs() for VFS path operations
initramfs: use scoped_with_init_fs() for rootfs unpacking
af_unix: use scoped_with_init_fs() for coredump socket lookup
fs: add real_fs to track task's actual fs_struct
fs: make userspace_init_fs a dynamically-initialized pointer
fs: stop sharing fs_struct between init_task and pid 1
fs: add umh argument to struct kernel_clone_args
fs: add kthread_mntns()
devtmpfs: create private mount namespace
nullfs: make nullfs multi-instance
fs: start all kthreads in nullfs
fs: stop rewriting kthread fs structs

drivers/base/devtmpfs.c | 2 +-
drivers/block/rnbd/rnbd-srv.c | 4 +-
drivers/crypto/ccp/sev-dev.c | 12 ++---
drivers/target/target_core_alua.c | 6 ++-
drivers/target/target_core_pr.c | 4 +-
fs/btrfs/volumes.c | 11 ++++-
fs/coredump.c | 11 ++---
fs/fs_struct.c | 96 ++++++++++++++++++++++++++++++++++++++-
fs/kernel_read_file.c | 9 +---
fs/namespace.c | 40 ++++++++++++++--
fs/nullfs.c | 7 +--
fs/smb/server/mgmt/share_config.c | 4 +-
fs/smb/server/smb2pdu.c | 4 +-
fs/smb/server/vfs.c | 14 ++++--
include/linux/fs_struct.h | 34 ++++++++++++++
include/linux/init_task.h | 1 +
include/linux/mount.h | 1 +
include/linux/sched.h | 1 +
include/linux/sched/task.h | 1 +
init/init_task.c | 1 +
init/initramfs.c | 12 +++--
init/main.c | 10 +++-
kernel/fork.c | 41 +++++++++++------
net/unix/af_unix.c | 17 +++----
24 files changed, 266 insertions(+), 77 deletions(-)
---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260303-work-kthread-nullfs-875a837f4198