Re: mnt_want_write_file() has problem?

From: Dave Hansen
Date: Mon Aug 03 2009 - 16:37:17 EST


On Tue, 2009-08-04 at 03:48 +0900, OGAWA Hirofumi wrote:
> Dave Hansen <dave@xxxxxxxxxxxxxxxxxx> writes:
>
> > On Mon, 2009-08-03 at 06:36 +0900, OGAWA Hirofumi wrote:
> >> While I'm reading some code, I suspected that mnt_want_write_file() may
> >> have wrong assumption. I think mnt_want_write_file() is assuming it
> >> increments ->mnt_writers if (file->f_mode & FMODE_WRITE). But, if it's
> >> special_file(), it is false?
> >>
> >> Sorry, I'm still not checking all of those though. E.g. I'm thinking the
> >> below.
> >>
> >> static inline int __get_file_write_access(struct inode *inode,
> >> struct vfsmount *mnt)
> >> {
> >> [...]
> >> if (!special_file(inode->i_mode)) {
> >> /*
> >> * Balanced in __fput()
> >> */
> >> error = mnt_want_write(mnt);
> >> if (error)
> >> put_write_access(inode);
> >> }
> >> return error;
> >> }
> >
> > In practice I don't think this is an issue. We were never supposed to
> > do mnt_want_write(mnt) for any 'struct file' that was a special_file(),
> > specifically because of what you mention.
> >
> > Nick's use of mnt_want_write_file() was a 1:1 drop-in for
> > mnt_want_write(). So, if all is well in the world, there should not be
> > any call sites where mnt_want_write_file() gets called on a
> > special_file().
>
> void file_update_time(struct file *file)
> sys_fchmod()
> sys_fchown()
> sys_fsetxattr()
> sys_fremovexattr()
>
> Um..., the users of mnt_want_write_file() seems to be those. I think
> all of those filp can be special file?

OK, I see where you're going now. I think the race goes like this:

Let's say we have a process with /dev/null opened with FMODE_WRITE. It
is the only file open on the filesystem and so the /dev mount has a 0
mnt_writers count. That process goes to f_chmod() its fd to /dev/null.
The code checks and notices that (file->f_mode & FMODE_WRITE), and goes
to mnt_clone_write().

At the same time, another process tries to 'mount -o remount,ro /dev'.
That process never sees mnt_clone_write()'s mnt_writers bump and allows
the remount,ro, even though there's an elevated mnt_writers count.

Here's a completely untested/uncompiled patch. I'll see if I can find a
test case that triggers this bug with the BUG_ON() in this patch.

diff --git a/fs/namespace.c b/fs/namespace.c
index 277c28a..a4714c4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -294,9 +294,17 @@ EXPORT_SYMBOL_GPL(mnt_want_write);
*
* After finished, mnt_drop_write must be called as usual to
* drop the reference.
+ *
+ * Be very careful using this. You must *guarantee* that
+ * this vfsmount has at least one existing, persistent writer
+ * that can not possibly go away, before calling this.
*/
int mnt_clone_write(struct vfsmount *mnt)
{
+ /* This would kill the performance
+ * optimization in this function
+ BUG_ON(count_mnt_writers(mnt) > 0);
+ */
/* superblock may be r/o */
if (__mnt_is_readonly(mnt))
return -EROFS;
@@ -312,14 +320,20 @@ EXPORT_SYMBOL_GPL(mnt_clone_write);
* @file: the file who's mount on which to take a write
*
* This is like mnt_want_write, but it takes a file and can
- * do some optimisations if the file is open for write already
+ * do some optimisations if the file is open for write already.
+ * We do not do mnt_want_write() on read-only or special files,
+ * so we can not use mnt_clone_write() for them.
*/
int mnt_want_write_file(struct file *file)
{
- if (!(file->f_mode & FMODE_WRITE))
- return mnt_want_write(file->f_path.mnt);
- else
- return mnt_clone_write(file->f_path.mnt);
+ struct path *path = &file->f_path;
+ struct inode *inode = path->dentry->d_inode;
+
+ if ((file->f_mode & FMODE_WRITE) &&
+ !special_file(inode))
+ return mnt_clone_write(path->mnt);
+
+ return mnt_want_write(path->mnt);
}
EXPORT_SYMBOL_GPL(mnt_want_write_file);



-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/