Re: [RFC] lookup_one_len_unlocked() lousy calling conventions

From: Al Viro
Date: Sun Nov 03 2019 - 13:21:03 EST


On Sun, Nov 03, 2019 at 04:35:24PM +0000, Al Viro wrote:

> lookup_one_len_unlocked() calling conventions are wrong for its callers.
> Namely, 11 out of 12 callers really want ERR_PTR(-ENOENT) on negatives.
> Most of them take care to check, some rely upon that being impossible in
> their case. Interactions with dentry turning positive right after
> lookup_one_len_unlocked() has returned it are of varying bugginess...
>
> The only exception is ecryptfs, where we do lookup in the underlying fs
> on ecryptfs_lookup() and want to retain a negative dentry if we get one.

Looking at that code... the thing that deals with the result of lookup in
underlying fs is ecryptfs_lookup_interpose(), and there we have this:
struct inode *inode, *lower_inode = d_inode(lower_dentry);
...
dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL);
...
if (d_really_is_negative(lower_dentry)) {
/* We want to add because we couldn't find in lower */
d_add(dentry, NULL);
return NULL;
}
inode = __ecryptfs_get_inode(lower_inode, dentry->d_sb);

If lower dentry used to be negative, but went positive while we'd
been doing allocation, we'll get d_really_is_negative() (i.e.
!lower_dentry->d_inode) false, but lower_inode (fetched earlier)
still NULL. __ecryptfs_get_inode() starts with
if (lower_inode->i_sb != ecryptfs_superblock_to_lower(sb))
return ERR_PTR(-EXDEV);
which won't be happy in that situation... That has nothing to do
with barriers, ->d_flags, etc. - the window is rather wide here.
GFP_KERNEL allocation can block just fine.

IOW, the only caller of lookup_one_len_unlocked() that does not
reject negative dentries doesn't manage to handle them correctly ;-/

And then in the same ecryptfs_lookup_interpose() we have e.g.
fsstack_copy_attr_atime(d_inode(dentry->d_parent),
d_inode(lower_dentry->d_parent));
Now, dentry->d_parent is stable; dentry is guaranteed to be new
and not yet visible to anybody else, besides it's negative and
the parent is held shared, so it couldn't have been moved around
even if it had been seen by somebody else.

However, lower_dentry->d_parent is a different story. We are not holding
the lock on its parent anymore; it could've been moved around by somebody
mounting the underlying layer elsewhere and accessing it directly.
Moreover, there's nothing to guarantee that the pointer we fetch from
lower_dentry->d_parent won't be pointing to freed memory by the
time we get around to looking at its inode - lose the timeslice to
preemption just after fetching ->d_parent, have another process move
the damn thing around and there's nothing to keep the ex-parent
around by the time you regain CPU.

The problem goes all way back to addd65ad8d19 "eCryptfs: Filename Encryption:
filldir, lookup, and readlink" from 2009. That turned
lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
into
lower_dir_dentry = lower_dentry->d_parent;
and it had hit the fan...

Sure, "somebody mounted the underlying fs elsewhere and is actively
trying to screw us over" is not how ecryptfs is supposed to be used
and it can demonstrate unexpected behavior - odd errors, etc.
But that behaviour should not include oopsen and access to freed
kernel memory...