Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

From: Al Viro
Date: Mon Mar 21 2022 - 22:41:02 EST


On Mon, Mar 21, 2022 at 09:20:06AM -1000, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 21, 2022 at 05:55:53PM +0000, Al Viro wrote:
> > Why bother with rwsem, when we don't need anything blocking under it?
> > DEFINE_RWLOCK instead of DEFINE_SPINLOCK and don't make it static.
>
> Oh I mean, in case the common readers get way too hot, percpu_rwsem is a
> relatively easy way to shift the burder from the readers to the writers. I
> doubt we'll need that.
>
> > kernfs_walk_ns() - this is fucking insane; on the surface, it needs to
> > be exclusive due to the use of the same static buffer. It uses that
> > buffer to generate a pathname, *THEN* walks over it with strsep().
> > That's an... interesting approach, for the lack of other printable
> > terms - we walk the chain of ancestors, concatenating their names
> > into a buffer and separating those names with slashes, then we walk
> > that buffer, searching for slashes... WTF?
>
> It takes the @parent to walk string @path from. Where does it generate the
> pathname?

Sorry, misread that thing - the reason it copies the damn thing at all is
the use of strsep(). Yecch... Rule of the thumb regarding strsep() use,
be it in kernel or in the userland: don't. It's almost never the right
primitive to use.

Lookups should use qstr; it has both the length and place for hash.
Switch kernfs_find_ns() to that (and lift the calculation of length
into the callers that do not have it - note that kernfs_iop_lookup()
does) and you don't need the strsep() shite (or copying) anymore.

That would allow for kernfs_walk_ns() to take kernfs_rename_lock shared.

HOWEVER, that's not the only lock needed there and this patchset is
broken in that respect - it locks the starting node, then walks the
path. Complete with lookups in rbtrees of children in the descendents
of that node and those are *not* locked.

> > Wait a sec; what happens if e.g. kernfs_path_from_node() races with
> > __kernfs_remove()? We do _not_ clear ->parent, but we do drop references
> > that used to pin what it used to point to, unless I'm misreading that
> > code... Or is it somehow prevented by drain-related logics? Seeing
> > that it seems to be possible to have kernfs_path_from_node() called from
> > an interrupt context, that could be delicate...
>
> kernfs_remove() is akin to freeing of the node and all its descendants. The
> caller shouldn't be racing that against any other operations in the subtree.

That's interesting... My impression had been that some of these functions
could be called from interrupt contexts (judging by the spin_lock_irqsave()
in there). What kind of async contexts those are, and what do you use to
make sure they don't leak into overlap with kernfs_remove()?

In particular, if you ever use those from RCU callbacks, you need to make
sure that you have a grace period somewhere; the wait_event() you have in
kernfs_drain() won't do it.

I've tried to track the callchains that could lead there, but it gets
hairy fast.