Re: [PATCH 12/14] d_path: prepend_path(): lift the inner loop into a new helper

From: Al Viro
Date: Wed May 19 2021 - 11:56:22 EST


On Wed, May 19, 2021 at 11:07:04AM +0300, Andy Shevchenko wrote:
> On Wed, May 19, 2021 at 12:48:59AM +0000, Al Viro wrote:
> > ... and leave the rename_lock/mount_lock handling in prepend_path()
> > itself
>
> ...
>
> > + if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
> > + return 1; // absolute root
> > + else
> > + return 2; // detached or not attached yet
>
> Would it be slightly better to read
>
> if (IS_ERR_OR_NULL(mnt_ns) || is_anon_ns(mnt_ns))
> return 2; // detached or not attached yet
> else
> return 1; // absolute root
>
> ?
>
> Oh, I have noticed that it's in the original piece of code (perhaps separate
> change if we ever need it?).

The real readability problem here is not the negations. There are 4 possible
states for vfsmount encoded via ->mnt_ns:
1) not attached to any tree, kept alive by refcount alone.
->mnt_ns == NULL.
2) long-term unattached. Not a part of any mount tree, but we have
a known holder for it and until that's gone (making ->mnt_ns NULL), refcount
is guaranteed to remain positive. pipe_mnt is an example of such.
->mnt_ns == MNT_NS_INTERNAL, which is encoded as ERR_PTR(-1), thus the use of
IS_ERR_OR_NULL here (something I'd normally taken out and shot - use of that
primitive is a sign of lousy API or of a cargo-culted "defensive programming").
3) part of a temporary mount tree; not in anyone's namespace.
->mnt_ns points the tree in question, ->mnt_ns->seq == 0.
4) belongs to someone's namespace. ->mnt_ns points to that,
->mnt_ns->seq != 0. That's what we are looking for here.

It's kludges all the way down ;-/ Note that temporary tree can't become
a normal one or vice versa - mounts can get transferred to normal namespace,
but they will see ->mnt_ns reassigned to that. IOW, ->mnt_ns->seq can't
get changed without a change to ->mnt_ns. I suspect that the right way
to handle that would be to have that state stored as explicit flags.

All mounts are created (and destroyed) in state (1); state changes:
commit_tree() - (1) or (3) to (3) or (4)
umount_tree() - (3) or (4) to (1)
clone_private_mount() - (1) to (2)
open_detached_copy() - (1) to (3)
copy_mnt_ns() - (1) to (4)
mount_subtree() - (1) to (3)
fsmount() - (1) to (3)
init_mount_tree() - (1) to (4)
kern_mount() - (1) to (2)
kern_unmount{,_array}() - (2) to (1)

commit_tree() has a pathological call chain that has it
attach stuff to temporary tree; that's basically automount by lookup in
temporary namespace. It can distinguish it from the usual (adding to
normal namespace) by looking at the state of mountpoint we are attaching
to - or simply describe all cases as "(1) or (3) to whatever state the
mountpoint is".

One really hot path where we check (1) vs. (2,3,4) is
mntput_no_expire(), which is the initial reason behind the current
representation. However, read from ->mnt_flags is just as cheap as
that from ->mnt_ns and the same reasons that make READ_ONCE()
legitimate there would apply to ->mnt_flags as well.

We can't reuse MNT_INTERNAL for that, more's the pity -
it's used to mark the mounts (kern_mount()-created, mostly) that
need to destroyed synchronously on the final mntput(), with no
task_work_add() allowed (think of module_init() failing halfway through,
with kern_unmount() done to destroy the internal mounts already created;
we *really* don't want to delay that filesystem shutdown until insmod(2)
heads out to userland). Another headache is in LSM shite, as usual...

Anyway, sorting that out is definitely a separate story.