Re: fs: NULL deref in atime_needs_update

From: Al Viro
Date: Mon Feb 29 2016 - 08:09:36 EST


On Sun, Feb 28, 2016 at 08:01:01PM +0000, Al Viro wrote:
> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:
>
> > Erm... What's to order ->d_inode and ->d_flags fetches there? David?
> > Looks like the barrier in d_is_negative() is on the wrong side of fetch.
> > Confused...
>
> OK, as per David's suggestion, let's flip them around, bringing the
> barrier in d_is_negative() between them. Dmitry, could you try this on
> top of mainline? Again, it's until the first warning.

Hmm... Reordering is definitely wrong, but what I really wonder is if
dentry_rcuwalk_invalidate() is right outside of __d_drop(). IOW, is
it right in __d_instantiate() and dentry_unlink_inode()? The code
dealing with ->d_flags in RCU mode is more interested in coherency between
->d_flags and ->d_inode and it looks like we'd need *two* increments -
even-to-odd before updating both and odd-to-even after both are in sync.
The more I look at the situation with d_is_...() wrt barriers and ->d_seq,
the less I understand it; outside of RCU mode we don't really need the
barriers for that stuff and in RCU mode ->d_flags handling had been
a serious headache all along...

I'm tempted to do as below; the amount of smp_wmb() remains the same and
so's the amount of stores (splitting that += 2 in two doesn't affect that -
we dirty the same cacheline before and after anyway). OTOH, that would
mean that ->d_seq match guarantees ->d_flags and ->d_inode being in sync.
And I suspect that we could drop _read_ barriers in d_is_...() after that;
in non-RCU mode we don't give a damn anyway and in RCU one ->d_seq check
would provide one; it doesn't really matter on x86, but smp_rmb() may be
costly. Splitting ->d_seq increments *does* matter on x86 wrt correctness;
in-between state becomes guaranteed ->d_seq mismatch and that just might
be it. Dmitry, could you add this on top of the previous patch?

David, Linus, do you see any problems with that? To me it looks saner
that way and as cheap as the current code, but I might be missing something
here...

diff --git a/fs/dcache.c b/fs/dcache.c
index 92d5140..2c08cce 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -279,7 +279,6 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
unsigned flags;

dentry->d_inode = inode;
- smp_wmb();
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
@@ -300,7 +299,6 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)

flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
WRITE_ONCE(dentry->d_flags, flags);
- smp_wmb();
dentry->d_inode = NULL;
}

@@ -370,9 +368,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
__releases(dentry->d_inode->i_lock)
{
struct inode *inode = dentry->d_inode;
+
+ raw_write_seqcount_begin(&dentry->d_seq);
__d_clear_type_and_inode(dentry);
hlist_del_init(&dentry->d_u.d_alias);
- dentry_rcuwalk_invalidate(dentry);
+ raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
if (!inode->i_nlink)
@@ -1758,8 +1758,9 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
spin_lock(&dentry->d_lock);
if (inode)
hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+ raw_write_seqcount_begin(&dentry->d_seq);
__d_set_inode_and_type(dentry, inode, add_flags);
- dentry_rcuwalk_invalidate(dentry);
+ raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
fsnotify_d_instantiate(dentry, inode);
}