Re: fs: NULL deref in atime_needs_update

From: Dmitry Vyukov
Date: Wed Feb 24 2016 - 05:17:25 EST

On Wed, Feb 24, 2016 at 11:03 AM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> On Wed, Feb 24, 2016 at 5:46 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>> On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote:
>>> On Wed, 2016-02-17 at 00:40 +0100, MickaÃl SalaÃn wrote:
>>> > Hi,
>>> >
>>> > Actually I found the same bug (without fuzzing) and I can reproduce it
>>> > in a deterministic way (e.g. by creating a LSM that return 1 for the
>>> > security_file_open hook). At least, from v4.2.8 I can easily trigger
>>> > traces like this :
>>> Reading through this thread I wonder if this is a new problem.
>>> Is there a previous kernel it can't be reproduced on?
>>> Perhaps a bisect will shed some light on what's happening.
>> There are several things in the mix. What MickaÃl has found is that a bunch
>> of places where _positive_ number returned instead of expected 0 or -E... can
>> get propagated all way back to caller of do_last(), confusing the hell out
>> of it. That's not what Dmitry has triggered, though. WARN_ON() in the
>> end of do_last() would've triggered, and IMO this one, along with mitigation
>> (map that "error value" to -EINVAL) should go into mainline and all -stable.
>> Bogus ->open() returning a positive number had always been bad news; in the
>> best case it would be returned to userland, leading to "open(2) has failed
>> and returned a positive number". Hell knows if we ever had such instances
>> (or have them right now), but I wouldn't bet on their absense. Rare
>> failure exits returning bogus values in an ->open() instance in some driver
>> can easily stay unnoticed for a long time. Starting from at least 3.6
>> (circa the atomic_open support) it got more unpleasant than simple "confuse
>> the hell out of userland". ->open() isn't the only vector for injection of
>> such crap - ->permission() would also serve, same for several LSM turds, etc.
>> Again, that's a separate problem. What Dmitry seems to be catching is getting
>> crap values fed to should_follow_link() as inode. I see one bug in that
>> area that does need fixing (fix present in the last patch I've posted, with
>> WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should
>> be gone from the final variant, since this can trigger without driver bugs,
>> etc.) In RCU mode after we'd checked that dentry is a symlink one, we need
>> to verify that it hadn't been changed since before we'd fetched the inode.
>> It might have been e.g. a regular file, which got unlinked with symlink
>> created in its place. Then we'd go into get_link() with non-symlink inode
>> and oops on attempt to call its ->i_op->get_link(). That race is real, very
>> hard to hit (you need both the unlink(2) and symlink(2) to happen between
>> lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you
>> can't lose the timeslice there) and would've manifested differently.
>> But that leaves other two kinds of bugs getting triggered: inode of some
>> non-symlink is possible, but what we saw included NULL inode when we'd
>> reached finish_open: in do_last(). Should be flat-out impossible - we either
>> get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or
>> get NULL inode from pinned d_is_symlink() dentry after having grabbed
>> a reference and left RCU mode. Neither should be possible without either
>> something weird happening to lookup_fast() (and we would've seen oopsen in
>> link_path_walk() if that could happen, BTW) or screwed dentry refcounting
>> somewhere, combined with a race that managed to turn...
>> Oh, shit. No screwed refcounting is needed. Look:
>> BUG_ON(nd->flags & LOOKUP_RCU);
>> inode = d_backing_inode(path.dentry);
>> seq = 0; /* out of RCU mode, so the value doesn't matter */
>> if (unlikely(d_is_negative(path.dentry))) {
>> path_to_nameidata(&path, nd);
>> return -ENOENT;
>> }
>> Suppose we come here with negative path.dentry. We are holding a reference,
>> all right, and for a _postive_ dentry that would've been enough to keep
>> it positive. Not so for a negative one, though - symlink(2) on another
>> CPU doint d_instantiate() just before the d_is_negative() check and we
>> are fucked - inode is stale and we sail through all the checks, all the
>> way into should_follow_link().
>> We also have the same kind of crap in walk_component() -
>> err = lookup_slow(nd, &path);
>> if (err < 0)
>> return err;
>> inode = d_backing_inode(path.dentry);
>> seq = 0; /* we are already out of RCU mode */
>> err = -ENOENT;
>> if (d_is_negative(path.dentry))
>> goto out_path_put;
>> There it's much harder to hit, though - we need it not just d_instantiate()
>> overlapping those lines; we need the racing syscall to get from locking
>> the parent to d_instantiate() between the point where lookup_slow() has
>> unlocked the parent and d_is_negative(). And lookup_slow() couldn't have
>> gone into mountpoint crossing, so it's really hard to hit - you pretty
>> much have to get preempted just after fetching inode.
>> OK, the next delta to try, and there definitely are several commits in
>> that pile. It still does not explain the traces with GPF at 0x50, though -
>> for that we need not just a NULL getting to should_follow_link() but
>> something non-NULL with NULL at offset 40 from it (offset of ->i_sb in
>> struct inode). That something *can't* be a valid struct inode or had been
>> one in recent past - ->i_sb is assigned in new_inode(), value is always
>> non-NULL and never modified all the way until RCU-delayed freeing of struct
>> inode. So that has to be something entirely different... Anyway, the
>> patch so far follows:
> Restarted testing with this patch (dropped previous patches because
> they conflict).
> These "unlikely" scenarios can be more likely inside of VMs where
> effective preemption can happen at random points. Also NMIs probably
> can increase probability of such races.

For now I can only say that I am hitting this one (3 times in 20 mins):

if (read_seqcount_retry(&link->dentry->d_seq, seq)) {
WARN_ON(1); // just as way to report hitting
// that path; it's OK to get
// here, in the final variant
// WARN_ON will disappear.