Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount

From: Ian Kent
Date: Thu Apr 10 2025 - 10:07:55 EST



On 10/4/25 00:04, Christian Brauner wrote:
On Wed, Apr 09, 2025 at 04:25:10PM +0200, Sebastian Andrzej Siewior wrote:
On 2025-04-09 16:02:29 [+0200], Mateusz Guzik wrote:
On Wed, Apr 09, 2025 at 03:14:44PM +0200, Sebastian Andrzej Siewior wrote:
One question: Do we need this lazy/ MNT_DETACH case? Couldn't we handle
them all via queue_rcu_work()?
If so, couldn't we have make deferred_free_mounts global and have two
release_list, say release_list and release_list_next_gp? The first one
will be used if queue_rcu_work() returns true, otherwise the second.
Then once defer_free_mounts() is done and release_list_next_gp not
empty, it would move release_list_next_gp -> release_list and invoke
queue_rcu_work().
This would avoid the kmalloc, synchronize_rcu_expedited() and the
special-sauce.

To my understanding it was preferred for non-lazy unmount consumers to
wait until the mntput before returning.

That aside if I understood your approach it would de facto serialize all
of these?

As in with the posted patches you can have different worker threads
progress in parallel as they all get a private list to iterate.

With your proposal only one can do any work.

One has to assume with sufficient mount/unmount traffic this can
eventually get into trouble.
Right, it would serialize them within the same worker thread. With one
worker for each put you would schedule multiple worker from the RCU
callback. Given the system_wq you will schedule them all on the CPU
which invokes the RCU callback. This kind of serializes it, too.

The mntput() callback uses spinlock_t for locking and then it frees
resources. It does not look like it waits for something nor takes ages.
So it might not be needed to split each put into its own worker on a
different CPU… One busy bee might be enough ;)
Unmounting can trigger very large number of mounts to be unmounted. If
you're on a container heavy system or services that all propagate to
each other in different mount namespaces mount propagation will generate
a ton of umounts. So this cannot be underestimated.

If a mount tree is wasted without MNT_DETACH it will pass UMOUNT_SYNC to
umount_tree(). That'll cause MNT_SYNC_UMOUNT to be raised on all mounts
during the unmount.

If a concurrent path lookup calls legitimize_mnt() on such a mount and
sees that MNT_SYNC_UMOUNT is set it will discount as it know that the
concurrent unmounter hold the last reference and it __legitimize_mnt()
can thus simply drop the reference count. The final mntput() will be
done by the umounter.

In umount_tree() it looks like the unmounted mount remains hashed (ie.

disconnect_mount() returns false) so can't it still race with an rcu-walk

regardless of the sybcronsize_rcu().


Surely I'm missing something ...


Ian


The synchronize_rcu() call in namespace_unlock() takes care that the
last mntput() doesn't happen until path walking has dropped out of RCU
mode.

Without it it's possible that a non-MNT_DETACH umounter gets a spurious
EBUSY error because a concurrent lazy path walk will suddenly put the
last reference via mntput().

I'm unclear how that's handled in whatever it is you're proposing.