Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"

From: Al Viro

Date: Mon Feb 02 2026 - 13:43:27 EST

On Mon, Feb 02, 2026 at 05:50:30PM +0000, Kiryl Shutsemau wrote:

> In the Meta fleet, we saw a problem where destroying a container didn't
> lead to freeing the shmem memory attributed to a tmpfs mounted inside
> that container. It triggered an OOM when a new container attempted to
> start.
>
> Investigation has shown that this happened because a process outside of
> the container kept a file from the tmpfs mapped. The mapped file is
> small (4k), but it holds all the contents of the tmpfs (~47GiB) from
> being freed.
>
> When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
> container), and a process outside that namespace holds an open file
> descriptor to a file on that tmpfs, the tmpfs superblock remains in
> kernel memory indefinitely after:
>
> 1. All processes inside the mount namespace have exited.
> 2. The mount namespace has been destroyed.
> 3. The tmpfs is no longer visible in any mount namespace.

Yes? That's precisely what should happen as long as something's opened
on a filesystem.

> The superblock persists with mnt_ns = NULL in its mount structures,
> keeping all tmpfs contents pinned in memory until the external file
> descriptor is closed.

Yes.

> The problem is not specific to tmpfs, but for filesystems with backing
> storage, the memory impact is not as severe since the page cache is
> reclaimable.
>
> The obvious solution to the problem is "Don't do that": the file should
> be unmapped/closed upon container destruction.

Or remove the junk there from time to time, if you don't want it to stay
until the filesystem shutdown...

> But I wonder if the kernel can/should do better here? Currently, this
> scenario is hard to diagnose. It looks like a leak of shmem pages.
>
> Also, I wonder if the current behavior can lead to data loss on a
> filesystem with backing storage:
> - The mount namespace where my USB stick was mounted is gone.
> - The USB stick is no longer mounted anywhere.
> - I can pull the USB stick out.
> - Oops, someone was writing there: corruption/data loss.
>
> I am not sure what a possible solution would be here. I can only think
> of blocking exit(2) for the last process in the namespace until all
> filesystems are cleanly unmounted, but that is not very informative
> either.

That's insane - if nothing else, the process that holds the sucker
opened may very well be waiting for the one you've blocked.

You are getting exactly what you asked for - same as you would on
lazy umount, for that matter.

Filesystem may be active without being attached to any namespace;
it's an intentional behaviour. What's more, it _is_ visible to
ustat(2), as well as lsof(1) and similar userland tools in case
of opened file keeping it busy.