Re: NULL pointer dereference when access /proc/net

From: haosdent
Date: Mon May 03 2021 - 11:32:06 EST


Hi, Alexander, thanks a lot for your detailed explanations. I take a
look at the code again and the thread and I realize there are some
incorrect representations in my previous word.

>> struct inode *d_inode = 0x0 <======= d_inode is NULL and cause Oops!

Actually it is Oops at `inode->i_flags` directly instead of `d_inode`,
so the code would not further to change the access time.

```
bool __atime_needs_update(const struct path *path, struct inode *inode,
bool rcu)
{
struct vfsmount *mnt = path->mnt;
struct timespec now;

if (inode->i_flags & S_NOATIME) <======= Oops at here according to
`[19521409.372016] IP: __atime_needs_update+0x5/0x190`.
return false;
```

But it looks impossible if we take a look at "walk_component > lookup_fast".

Let me introduce why it goes through "walk_component > lookup_fast"
instead of "walk_component > lookup_slow" first,

in "walk_component > step_into > pick_link", the code would set
`nameidata->stack->seq` to `seq` which is comes from the passed in
parameters.
If the code goes through "walk_component > lookup_slow", `seq` would
be 0 and then `nameidata->stack->seq` would be 0.
If the code goes through "walk_component > lookup_slow", `seq` would
be `dentry->d_seq->sequence`.
According to the contents in attachment files "nameidata.txt" and
"dentry.txt", `dentry->d_seq->sequence` is 4, and
`nameidata->stack->seq` is 4 as well.
So looks like the code goes through "walk_component > lookup_fast" and
"walk_component > step_into > pick_link".

The `inode` parameter in `__atime_needs_update` comes from
`nameidata->link_inode`. But in attachment file "nameidata.txt", we
could found `nameidata->link_inode` is NULL already.
Because the code goes through "walk_component > lookup_fast" and
"walk_component > step_into > pick_link", the `inode` assign to
`nameidata->link_inode` must comes from `lookup_fast`.

So it looks like something wrong in `lookup_fast`. Let me continue to
explain why this looks impossible.

In `walk_component`, `lookup_fast` have to return 1 (> 0), otherwise
it would fallback to `lookup_slow`.

```
err = lookup_fast(nd, &path, &inode, &seq);
if (unlikely(err <= 0)) {
if (err < 0)
return err;
path.dentry = lookup_slow(&nd->last, nd->path.dentry,
nd->flags);
}

return step_into
```

Because for our case, it looks like the code goes through
"walk_component > lookup_fast" and "walk_component > step_into >
pick_link",
This infers `lookup_fast` return 1 in this Oops.

Because `lookup_fast` return 1, it looks like the code goes through
the following path.

```
if (nd->flags & LOOKUP_RCU) {
...

*inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);

...
...
if (negative)
return -ENOENT;
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
return 1;
```

As we see, if `*inode` is NULL, it should return `-ENOENT` because `if
(negative)` would be true, which is conflict with "`lookup_fast`
return 1".

And in `__d_clear_type_and_inode`, it always sets the dentry to
negative first and then sets d_inode to NULL.

```
static inline void __d_clear_type_and_inode(struct dentry *dentry)
{`
unsigned flags = READ_ONCE(dentry->d_flags);

flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); // Set dentry to
negative first.
WRITE_ONCE(dentry->d_flags, flags);
// memory barrier
dentry->d_inode = NULL;
// Then set d_inode to NULL.
}
```

So looks like `inode` in `lookup_fast` should not be NULL if it could
skip `if (negative)` check even in the RCU case. Unless

```
# in lookup_fast method
*inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
```

is reorder to

```
# in lookup_fast method
negative = d_is_negative(dentry);
*inode = d_backing_inode(dentry);
```

when CPU executing the code. But is this possible in RCU?

I diff my local ubuntu's code with upstream tag v4.15.18, it looks
like no different in `fs/namei.c`, `fs/dcache.c`, `fs/proc`.
So possible the problem may happen to upstream tag v4.15.18 as well,
sadly my script still could not reproduce the issue on the server so
far,
would like to see if any insights from you then I could continue to
check what's wrong in this Oops, thank you in advance!

On Tue, Apr 27, 2021 at 1:30 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Apr 27, 2021 at 01:16:44AM +0800, haosdent wrote:
> > > really should not assume ->d_inode stable
> >
> > Hi, Alexander, sorry to disturb you again. Today I try to check what
> > `dentry->d_inode` and `nd->link_inode` looks like when `dentry` is
> > already been killed in `__dentry_kill`.
> >
> > ```
> > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > ```
> >
> > It looks like `dentry->d_inode` could be NULL while `nd->link_inode`
> > is always has value.
> > But this make me confuse, by right `nd->link_inode` is get from
> > `dentry->d_inode`, right?
>
> It's sampled from there, yes. And in RCU mode there's nothing to
> prevent a previously positive dentry from getting negative and/or
> killed. ->link_inode (used to - it's gone these days) go with
> ->seq, which had been sampled from dentry->d_seq before fetching
> ->d_inode and then verified to have ->d_seq remain unchanged.
> That gives you "dentry used to have this inode at the time it
> had this d_seq", and that's what gets used to validate the sucker
> when we switch to non-RCU mode (look at legitimize_links()).
>
> IOW, we know that
> * at some point during the pathwalk that sucker had this inode
> * the inode won't get freed until we drop out of RCU mode
> * if we need to go to non-RCU (and thus grab dentry references)
> while we still need that inode, we will verify that nothing has happened
> to that link (same ->d_seq, so it still refers to the same inode) and
> grab dentry reference, making sure it won't go away or become negative
> under us. Or we'll fail (in case something _has_ happened to dentry)
> and repeat the entire thing in non-RCU mode.



--
Best Regards,
Haosdent Huang