Re: [RFC][PATCHSET] non-recursive pathname resolution

From: Al Viro
Date: Tue May 05 2015 - 19:59:58 EST


On Tue, May 05, 2015 at 08:20:16AM -0700, Linus Torvalds wrote:
> On Mon, May 4, 2015 at 10:22 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> > My apologies for the patchbomb from hell, but I really think
> > it needs a public review.
>
> Aside from the small nits I noted (and I guess I can even live with
> the bad label name as long as it never gets resurrected) it looks
> good. But I'd obviously like it to get as much testing/beating on as
> possible before merging: I'm assuming it will have gone at least
> through xfstests etc..

It is passing xfstests and LTP, plus some basic "create a twisted forest
of symlinks and walk it" tests, but yes, it obviously needs more beating.
I'll push everything up to #76 into -next tonight (with the changes you
asked for).

As for the next steps...
* we want _all_ failure exits from RCU mode to empty the stack
immediately - it will be too late to call ->put_link() once we dropped
rcu_read_lock().
* we want failure exits from complete_walk() to empty the stack,
for the sake of consistency (due to the above it will do so when called in
RCU mode).
* we want to postpone assignment to nd->seq until after we are
sure we haven't found a symlink to follow; I have that in local queue,
will post in the next batch for review.
* we want to have pick_link() store directly into nd->stack[...].link;
no point postponing that to get_link(). nd->link can die at that point.
Done in local queue.
* we want to store inode and seq along with path in nd->stack[...].
It means extra 4 words in struct nameidata (EMBEDDED_LEVELS * 2), i.e. not
too horrible wrt stack footprint. should_follow_link()/pick_link() get
those as additional arguments. Done in local queue.
* we need a variant of legitimize_mnt() that would _not_ leave RCU
mode. What I have in mind is
legitimize_mnt(m)
{
switch (__legitimize_mnt(m, seq)) {
case 0:
return true;
default:
rcu_read_unlock();
mntput(m);
rcu_read_lock();
case 1:
return false;
}
}
with the ability to call __legitimize_mnt(), remember the return value and,
in case of failure, do all pending ->put_link() before dropping rcu_read_lock()
We'll need that for unlazy_walk() et.al. - do this __legitimize_mnt()
on path->nd.mnt and all nd->stack[0..nd->depth - 1].mnt, and if any of them
fail, stop right there, do ->put_link() on everything, drop rcu_read_lock(),
mntput() everything on stack we'd already legitimized, reset nd->depth to 0 and bugger off the way we currently would. If all of __legitimize_mnt() succeed,
we go and acquire dentry references, checking them against nd->seq for
nd->path and nd->stack[i].seq for nd->stack[i].link. Any failure =>
->put_link() for everything on stack, rcu_read_unlock(), dput() everything
we'd managed to grab in the stack, mntput() everything in the stack,
reset nd->depth to 0 and bugger off the way we currently would.
complete_walk() is similar - it starts with could be
if (nd->flags & LOOKUP_RCU) {
if (!(nd->flags & LOOKUP_ROOT))
nd->root.mnt = NULL;
if (unlikely(unlazy_walk(nd, NULL))
return -ECHILD;
}
and perhaps it would make sense to try just that and see if unlazy_walk()
goes visibly higher in profiles... Note that it *can* be called with
non-empty stack and be required to legitimize its (only) element - e.g.
creat() on a dangling symlink will step into that. So we can't just
start with unconditionally emptying the stack.
terminate_walk() is simpler - if we are in RCU mode, just call
->put_link() on everything in stack and reset nd->depth; if we are not,
leave as in this patchset.
* ->put_link() should get inode + cookie instead of dentry + cookie;
sure, right now the only instance that wants more than cookie is in hppfs
and that's scheduled for removal, but we need the inode to find the damn
method anyway, so passing it is no extra burden. Should be RCU-safe, which
is actually true for all instances.
* ->follow_link() should take dentry + inode, with NULL/inode meaning
"we are in RCU mode, return -ECHILD instead of doing anything you can't do
in RCU pathwalk". get_link() should be basically
touch atime (might throw out of RCU mode)
do security_inode_follow_link(), fuck off if it fails
nd->last_type = LAST_BIND;
if (inode->i_link)
return it, we are done
if (in RCU mode) {
link = ...->follow_link(NULL, inode);
if (link == ERR_PTR(-ECHILD))
try to unlazy, return link if that fails
else
return link;
}
link = ...->follow_link(dentry, inode);
then as in this patchset.
* nd_alloc_stack() should do GFP_ATOMIC allocation in RCU mode and
treat a failure as "try to unlazy, fail with ECHILD if can't do that, proceed
to GFP_KERNEL allocation otherwise".
* may_follow_link() should, in case of failure, check if it's in
RCU mode and do terminate_walk()/fail with ECHILD instead of going into
audit crap. Sure, it'll cost us a restart that will almost certainly end
with hard error on the same symlink, but audit_log_link_denied() isn't
usable in RCU mode. We probably could just make it use GFP_ATOMIC
for allocations, but that's a separate story...
* we need set_root_rcu() to store ->d_seq of root in nameidata,
so that traversing an absolute symlink in RCU mode would be able to start
with correct nd->seq - fetch nd->root.dentry->d_inode, check
nd->root.dentry->d_seq against nd->root_seq, fail with -ECHILD on mismatch
and use inode and root_seq for nd->inode / nd->seq otherwise.
* put_link() should, as in Neil's patchset, have path_put() conditional
on non-RCU mode.
* unlazy_walk() is really unpleasant. In addition to the issues
around legitimizing many vfsmounts without being allowed to drop rcu_read_lock
(discussed above), we have three use cases:
1) unlazy_walk(nd, NULL). legitimize nd->path and everything already
in nd->stack, from 0 to nd->depth - 1. On failure do ->put_link() for
everything on stack *before* dropping rcu_read_lock().
2) unlazy_walk() in lookup_fast(). We want to give it correct seq number
(and I already have that done in local tree) and legitimize that dentry +
what case 1 would do.
3) unlazy between deciding it's a symlink to follow and succeeding with
->follow_link() in get_link(). It's this case that really sucks - we
want to legitimize everything we would in case 1 + path/dentry of our link.
And it _can_ have a different vfsmount - we can't just rely on getting
nd->path.mnt being equal to it. It's actually more like case 1 than case 2 -
"legitimize nd->path and everything on stack from 0 to nd->depth, without
->put_link() in case of failure done only from 0 to nd->depth - 1 (as in
case 1)". We probably should just do this:
m = mnt of pending link
res = __legitimize_mnt(m);
if (likely(!res)) {
d = dentry of pending link
try to legitimize d
if failed
res = 2;
}
if (unlikely(res)) {
terminate_walk(nd);
if (res != 1)
mntput(m);
return -ECHILD;
}
res = unlazy_walk(nd, NULL);
if (unlikely(res)) {
dput(d);
mntput(m);
}
return res;

AFAICS, the above should suffice, modulo teaching ->follow_link() instances
to fail with ECHILD when called in RCU mode and then teaching some of them
_not_ to. Note that fast symlinks would be fine without the last part and
that covers most of the actual use...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/