Re: [RFC][PATCH] vfs: In mntput run deactivate_super on a shallow stack.
From: Eric W. Biederman
Date: Sun Apr 13 2014 - 03:02:50 EST
Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:
> On Sat, Apr 12, 2014 at 03:15:39PM -0700, Eric W. Biederman wrote:
>
>> Can you explain which scenario you are thinking about with respect to a
>> failed modprobe?
>
> Completely made up example:
>
> static struct file_system_type foofs = {
> .mount = mount_foo,
> .kill_sb = kill_foo,
> };
>
> static struct vfsmount *mnt;
>
> static __init int foo_init(void)
> {
> int err;
> err = init_some();
> if (err < 0)
> return err;
> mnt = kern_mount(&foofs);
> if (IS_ERR(mnt)) {
> uninit_some();
> return PTR_ERR(mnt);
> }
> err = init_some_more();
> if (err < 0) {
> kern_umount(mnt);
> uninit_some();
> return err;
> }
> printk(KERN_INFO "loaded foo");
> return 0;
> }
>
> Now, think what happens if init_some_more() in the above fails. With the
> current mntput() semantics, everything works. After making mntput() (from
> kern_umount()) delayed until the return to userland, we end up with attempt
> to call kill_foo() after the memory where it code sits gets freed. For that
> matter, by that point we are not even guaranteed to reach it, since it
> comes as mnt->mnt_sb->s_type->kill_sb() and s_type points to freed memory.
>
> I'm not saying that we have something that would closely resemble this
> example, but it's not hard to vary it in a lot of ways, keeping the same
> problem. Basically, you need to audit all paths leading from failure
> exits in some module_init() to mntput() and figure out if delaying the
> effect of that mntput() would be safe there (== doesn't get delayed past
> the point where we destroy something needed for that fs shutdown).
>
> It's not *that* horrible, since not too many modules out there are
> declaring any fs types, but it needs to be done. In theory, you could
> also fall prey to something like this:
> type = get_fs_type("proc");
> ns = kmalloc(...);
> /* fill *ns */
> mnt = kern_mount_data(type, p);
> ...
> if (error) {
> kern_unmount(mnt);
> kfree(p);
> put_filesystem(type);
> }
> possibly with get_fs_type() replaced with some other way to get that
> pointer to fs type (defined elsewhere). E.g. for procfs it could
> be, say, task_active_pid_ns(current)->proc_mnt->mnt_sb->s_type, etc.
>
> Again, it's not impossible to audit (there's not a lot of places where
> struct file_system_type * is ever stored, there are few instances of
> struct file_system_type, all statically allocated, etc.), but it's
> a non-trivial amount of work. And I honestly don't know if we have
> any such places right now. Moreover, unless you feel like repeating
> that kind of audit every merge window, we'll need a some way of dealing
> with such situations. Something like flush_pending_mntput(fs_type), for
> example, documented as barrier to be used in such places might do, but
> if you can think of something more fool-proof...
I performed a quick audit and I don't see that case happening in the
current code.
Further I believe we already have a pretty good protection against the
class of bug you are talking about.
deactivate_locked_super calls put_filesystem which calls module_put. So
we have a reference to the module the filesystem was loaded in so the
code and static data is not going away until the we free the super block
and decrement the module reference count.
So while I think it is possible for kill_sb, or put_super to reference
so global module data that we have cleaned up I think it is a pretty
unlikely situation.
kern_mount or kern_mount_data seems to be the very last thing modules
that reference it perform. I will take a closer look before I send the
final version of this but I think you are worrying about something that
just doesn't happen in practice and isn't likely too.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/