Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

From: Christian Brauner

Date: Tue Mar 10 2026 - 10:11:11 EST


On Mon, Mar 09, 2026 at 05:50:36PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > The places that need to perform lookup in init's filesystem state may
> > use scoped_with_init_fs() which will temporarily override the caller's
> > fs_struct with init's fs_struct.
>
> One small concern I have about the overall approach is that the use of
> scoped_with_init_fs() in non-kernel tasks reminds me a _little_ bit of
> the set_fs(KERNEL_DS) mechanism that was removed a few years ago:
> There is state in the task that controls whether some argument is
> interpreted as a user-supplied, untrusted value or a kernel-supplied
> value that is interpreted in some more privileged scope. I think there
> were occasionally security issues where userspace-supplied pointers
> were accidentally accessed under KERNEL_DS, allowing userspace to
> cause accesses to arbitrary kernel addresses - in particular,
> performance interrupts could occur in KERNEL_DS sections and attempt
> to access userspace stack memory, see
> <https://project-zero.issues.chromium.org/42452355>.
>
> I think switching task_struct::fs is much less problematic - path
> walks shouldn't happen in IRQ context or such, scoped_with_init_fs()
> will likely only be used when accessing paths that unprivileged
> userspace has no influence over, and VFS operations normally don't
> operate on multiple logically unrelated file paths; but it means we'll
> have to keep in mind that filesystem handlers for some operations like
> lookup/open can run with weird task_struct::fs.
>
> To be clear, I think what you're doing is fine; it's just something to
> keep in mind.

Just for some background. I think as it currently stands we have a 1:1
sharing between all kthreads and pid 1. So effectively a kthread is in a
permanent scope_with_init_fs() block. Any driver can just do:

file = filp_open("/proc/sys/kernel/core_pattern")
kernel_write(file, "/usr/bin/systemctl poweroff")

which is ofc nonsense but still.

But my wider point is that this implicit lookup context is probably in
very few people's mind.

Some people who are aware of this then end up with brilliant ideas such
as writing kernel modules that perform mountains of actual path lookup
work from kthread context because it's just so easy to do and lets them
avoid having to do any real conceptual work to come up with a better
solution.

Offloading fs work to kthreads is really nasty... And we've relearned
that lesson not too long ago when io_uring was still based on kthreads
with custom credential overrides. It's a broken concept.

scoped_with_init_fs() forces the users that do this to acknowledge that
they are now performing lookup work within PID 1's filesystem state. We
have few of those and this will make it harder to gain more.