Re: KASAN: use-after-free Read in path_lookupat

From: Al Viro
Date: Mon Mar 25 2019 - 15:43:40 EST


On Mon, Mar 25, 2019 at 11:36:01AM -0700, Linus Torvalds wrote:
> Right. Not just move the existing destroy_inode() - because as you
> say, people may not be able to to do that in RCU contect, but split it
> up, and add a "final_free_inode()" callback or something for the RCU
> phase.
>
> In fact, I suspect that *every* user of destroy_inode() already has to
> have its own RCU callback to another final freeing function anyway.

Nope - pipes do not, and for a good reason.

> Because they really shouldn't free the inode itself early. Maybe we
> can just make that be a generic thing?

Maybe... OTOH, we already have more methods on the destruction end
than I'm comfortable trying to document. Because we clearly *do*
need a clear set of rules along the lines of "this kind of thing belongs
in this method, this - in that one".

As it is, on the way to inode destruction we have

1) [kinda] ->drop_inode() - decide whether this inode is not worth
keeping around in icache once the refcount reaches zero. Predicate,
shouldn't change inode state at all. Common instances:
* default (encoded as NULL, available as generic_drop_inode()) -
"keep it around if it's still hashed and link count is non-zero".
* generic_delete_inode(): "don't retain that sucker"

However, 3 instances are more or less weird - f2fs, gfs and ocfs2.
gfs2 one is the least odd of those, but the other two...
What the hell is forced writeback doing in ocfs2_drop_inode()?
If they don't want to retain anything, fine, but then why do
the
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode->i_lock);
write_inode_now(inode, 1);
spin_lock(&inode->i_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_WILL_FREE;
dance in ->drop_inode()? It will be immediately followed by
->evict_inode(), and it feels like the damn thing would be
a lot more natural there... And what, for pity sake, f2fs is
doing with truncation in that predicate, of all places? The
comment in there is
/*
* This is to avoid a deadlock condition like below.
* writeback_single_inode(inode)
* - f2fs_write_data_page
* - f2fs_gc -> iput -> evict
* - inode_wait_for_writeback(inode)
*/
which looks... uninspiring, to put it mildly.

2) ->evict_inode() - called when we kick the inode out. Freeing
on-disk inodes, etc. belongs there. inode is still in icache
hash chain at that point, so any icache lookups for it will block
until that thing is done. If we have something non-trivial done
by iget... test() callbacks, we must keep the data structures needed
by those.

3) ->destroy_inode() - destructor. By that point all remaining
references to inode are (stale) RCU ones. The stuff that might
be reached via those has to have an RCU delay between the call
of ->destroy_inode() and freeing. Very commonly that's done
by call_rcu(), and more often than not it's the only thing in
->destroy_inode(). However, if we know that there'll be no
RCU accessors, we can do freeing immediately - pipes do just
that.

And the above is piss-poor as documentation goes - it doesn't
answer the "where should this go?" any better than "try to
see what similar filesystems are doing", which is asking for
cargo-culting ;-/