Re: NULL pointer dereference when access /proc/net
From: haosdent
Date: Thu May 06 2021 - 06:21:55 EST
Oh, Al, I saw you mentioned the reoder issue at
https://groups.google.com/g/syzkaller/c/0SW33jMcrXQ/m/lHJUsWHVBwAJ
and with a follow-up patch
https://gist.githubusercontent.com/dvyukov/67fe363d5ce2e2b06c71/raw/4d1b6c23f8dff7e0f8e2e3cab7e50208fddb0570/gistfile1.txt
However, it looks it would not work in some previous version
https://github.com/torvalds/linux/blob/v4.16/fs/dcache.c#L496
Because we set `d_hahs.prev` to `NULL` at
```
void __d_drop(struct dentry *dentry)
{
___d_drop(dentry);
dentry->d_hash.pprev = NULL;
}
```
then in `dentry_unlink_inode`, `raw_write_seqcount_begin` would be
skipped due to `if (hashed)` condition is false.
```
static void dentry_unlink_inode(struct dentry * dentry)
__releases(dentry->d_lock)
__releases(dentry->d_inode->i_lock)
{
struct inode *inode = dentry->d_inode;
bool hashed = !d_unhashed(dentry);
if (hashed)
raw_write_seqcount_begin(&dentry->d_seq); <--- Looks would
skip because hashed is false.
__d_clear_type_and_inode(dentry);
hlist_del_init(&dentry->d_u.d_alias);
if (hashed)
raw_write_seqcount_end(&dentry->d_seq); <--- Looks would
skip because hashed is false.
...
```
Should we backport
https://github.com/torvalds/linux/commit/4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab
and https://github.com/torvalds/linux/commit/0632a9ac7bc0a32f8251a53b3925775f0a7c4da6
to previous versions?
On Mon, May 3, 2021 at 11:31 PM haosdent <haosdent@xxxxxxxxx> wrote:
>
> Hi, Alexander, thanks a lot for your detailed explanations. I take a
> look at the code again and the thread and I realize there are some
> incorrect representations in my previous word.
>
> >> struct inode *d_inode = 0x0 <======= d_inode is NULL and cause Oops!
>
> Actually it is Oops at `inode->i_flags` directly instead of `d_inode`,
> so the code would not further to change the access time.
>
> ```
> bool __atime_needs_update(const struct path *path, struct inode *inode,
> bool rcu)
> {
> struct vfsmount *mnt = path->mnt;
> struct timespec now;
>
> if (inode->i_flags & S_NOATIME) <======= Oops at here according to
> `[19521409.372016] IP: __atime_needs_update+0x5/0x190`.
> return false;
> ```
>
> But it looks impossible if we take a look at "walk_component > lookup_fast".
>
> Let me introduce why it goes through "walk_component > lookup_fast"
> instead of "walk_component > lookup_slow" first,
>
> in "walk_component > step_into > pick_link", the code would set
> `nameidata->stack->seq` to `seq` which is comes from the passed in
> parameters.
> If the code goes through "walk_component > lookup_slow", `seq` would
> be 0 and then `nameidata->stack->seq` would be 0.
> If the code goes through "walk_component > lookup_slow", `seq` would
> be `dentry->d_seq->sequence`.
> According to the contents in attachment files "nameidata.txt" and
> "dentry.txt", `dentry->d_seq->sequence` is 4, and
> `nameidata->stack->seq` is 4 as well.
> So looks like the code goes through "walk_component > lookup_fast" and
> "walk_component > step_into > pick_link".
>
> The `inode` parameter in `__atime_needs_update` comes from
> `nameidata->link_inode`. But in attachment file "nameidata.txt", we
> could found `nameidata->link_inode` is NULL already.
> Because the code goes through "walk_component > lookup_fast" and
> "walk_component > step_into > pick_link", the `inode` assign to
> `nameidata->link_inode` must comes from `lookup_fast`.
>
> So it looks like something wrong in `lookup_fast`. Let me continue to
> explain why this looks impossible.
>
> In `walk_component`, `lookup_fast` have to return 1 (> 0), otherwise
> it would fallback to `lookup_slow`.
>
> ```
> err = lookup_fast(nd, &path, &inode, &seq);
> if (unlikely(err <= 0)) {
> if (err < 0)
> return err;
> path.dentry = lookup_slow(&nd->last, nd->path.dentry,
> nd->flags);
> }
>
> return step_into
> ```
>
> Because for our case, it looks like the code goes through
> "walk_component > lookup_fast" and "walk_component > step_into >
> pick_link",
> This infers `lookup_fast` return 1 in this Oops.
>
> Because `lookup_fast` return 1, it looks like the code goes through
> the following path.
>
> ```
> if (nd->flags & LOOKUP_RCU) {
> ...
>
> *inode = d_backing_inode(dentry);
> negative = d_is_negative(dentry);
>
> ...
> ...
> if (negative)
> return -ENOENT;
> path->mnt = mnt;
> path->dentry = dentry;
> if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
> return 1;
> ```
>
> As we see, if `*inode` is NULL, it should return `-ENOENT` because `if
> (negative)` would be true, which is conflict with "`lookup_fast`
> return 1".
>
> And in `__d_clear_type_and_inode`, it always sets the dentry to
> negative first and then sets d_inode to NULL.
>
> ```
> static inline void __d_clear_type_and_inode(struct dentry *dentry)
> {`
> unsigned flags = READ_ONCE(dentry->d_flags);
>
> flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); // Set dentry to
> negative first.
> WRITE_ONCE(dentry->d_flags, flags);
> // memory barrier
> dentry->d_inode = NULL;
> // Then set d_inode to NULL.
> }
> ```
>
> So looks like `inode` in `lookup_fast` should not be NULL if it could
> skip `if (negative)` check even in the RCU case. Unless
>
> ```
> # in lookup_fast method
> *inode = d_backing_inode(dentry);
> negative = d_is_negative(dentry);
> ```
>
> is reorder to
>
> ```
> # in lookup_fast method
> negative = d_is_negative(dentry);
> *inode = d_backing_inode(dentry);
> ```
>
> when CPU executing the code. But is this possible in RCU?
>
> I diff my local ubuntu's code with upstream tag v4.15.18, it looks
> like no different in `fs/namei.c`, `fs/dcache.c`, `fs/proc`.
> So possible the problem may happen to upstream tag v4.15.18 as well,
> sadly my script still could not reproduce the issue on the server so
> far,
> would like to see if any insights from you then I could continue to
> check what's wrong in this Oops, thank you in advance!
>
> On Tue, Apr 27, 2021 at 1:30 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Apr 27, 2021 at 01:16:44AM +0800, haosdent wrote:
> > > > really should not assume ->d_inode stable
> > >
> > > Hi, Alexander, sorry to disturb you again. Today I try to check what
> > > `dentry->d_inode` and `nd->link_inode` looks like when `dentry` is
> > > already been killed in `__dentry_kill`.
> > >
> > > ```
> > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128,
> > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966
> > > ```
> > >
> > > It looks like `dentry->d_inode` could be NULL while `nd->link_inode`
> > > is always has value.
> > > But this make me confuse, by right `nd->link_inode` is get from
> > > `dentry->d_inode`, right?
> >
> > It's sampled from there, yes. And in RCU mode there's nothing to
> > prevent a previously positive dentry from getting negative and/or
> > killed. ->link_inode (used to - it's gone these days) go with
> > ->seq, which had been sampled from dentry->d_seq before fetching
> > ->d_inode and then verified to have ->d_seq remain unchanged.
> > That gives you "dentry used to have this inode at the time it
> > had this d_seq", and that's what gets used to validate the sucker
> > when we switch to non-RCU mode (look at legitimize_links()).
> >
> > IOW, we know that
> > * at some point during the pathwalk that sucker had this inode
> > * the inode won't get freed until we drop out of RCU mode
> > * if we need to go to non-RCU (and thus grab dentry references)
> > while we still need that inode, we will verify that nothing has happened
> > to that link (same ->d_seq, so it still refers to the same inode) and
> > grab dentry reference, making sure it won't go away or become negative
> > under us. Or we'll fail (in case something _has_ happened to dentry)
> > and repeat the entire thing in non-RCU mode.
>
>
>
> --
> Best Regards,
> Haosdent Huang
--
Best Regards,
Haosdent Huang