Re: procfs: mnt namespace behaviour with block devices (resend)

From: Stephen Brennan
Date: Mon May 09 2022 - 14:48:46 EST


Hi Craig,

On 5/9/22 03:20, Craig Small wrote:
> (resending as plain text as the first got bounced)
>
> Hi,
> I'm the maintainer of the psmisc package that provides system tools
> for things like fuser and killall. I am trying to establish if
> something I have found with the proc filesystem is as intended
> (knowing why would be nice) or if it's a strange corner-case bug.
>
> Apologies to the non-procfs maintainers but these two lists are what
> MAINTAINER said to go to. If you could CC me on replies that would be
> great.
>
> The proc file descriptor for a block device mounted in a different
> namespace will show the device id of that different namespace and not
> the device id of the process stat()ing the file.
>
> The issue came up in fuser not finding certain processes that were
> directly accessing a block device, see
> https://gitlab.com/psmisc/psmisc/-/issues/39 Programs such as lsof are
> caught by this too.
>
> My question is: When I am in the bash mount namespace (4026531840 below)
> then shouldn't all the device IDs be from that namespace? In other
> words, the device id of the dereferenced symlink and what it points to
> are the same (device id 5) and not symlink has 44 and /dev/dm-8 has 5.
I'm no expert here, but I think this is working as intended.
It's definitely confusing!

Consider a process in a separate mount namespace from the init
namespace, e.g. a container. Say I were to open python in that container
and then do `os.open("/etc/passwd")`. If I were to then look at that
process's file descriptors (from the host's perspective), I'd see the
following (pid 220854 is the python process in the container):

$ ls -lah /proc/220854/fd/
total 0
dr-x------ 2 stepbren stepbren 0 May 9 11:06 .
dr-xr-xr-x 9 stepbren stepbren 0 May 9 11:06 ..
lrwx------ 1 stepbren stepbren 64 May 9 11:06 0 -> /dev/pts/0
lrwx------ 1 stepbren stepbren 64 May 9 11:06 1 -> /dev/pts/0
lrwx------ 1 stepbren stepbren 64 May 9 11:06 2 -> /dev/pts/0
lr-x------ 1 stepbren stepbren 64 May 9 11:06 3 -> /etc/passwd

$ cat /proc/220854/fd/3
<contents of container /etc/passwd>

$ cat /etc/passwd
<contents of host /etc/passwd>

$ stat -L /proc/220854/fd/3
File: /proc/220854/fd/3
Size: 900 Blocks: 8 IO Block: 4096 regular file
Device: 4eh/78d Inode: 5508982 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-10-27 10:24:28.000000000 -0700
Modify: 2020-10-27 10:24:28.000000000 -0700
Change: 2020-10-27 10:24:30.255374190 -0700
Birth: 2020-10-27 10:24:30.255374190 -0700

$ stat /etc/passwd
File: /etc/passwd
Size: 3216 Blocks: 8 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 24917416 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-05-08 15:06:18.837117765 -0700
Modify: 2021-11-30 09:08:45.163873193 -0800
Change: 2021-11-30 09:08:45.167873237 -0800
Birth: 2021-11-30 09:08:45.163873193 -0800

## INSIDE CONTAINER'S MOUNT NAMESPACE
$ stat /etc/passwd
File: /etc/passwd
Size: 900 Blocks: 8 IO Block: 4096 regular file
Device: 4eh/78d Inode: 5508982 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-10-27 17:24:28.000000000 +0000
Modify: 2020-10-27 17:24:28.000000000 +0000
Change: 2020-10-27 17:24:30.255374190 +0000
Birth: -

As you can see, it's the same behavior: the path /etc/passwd resolves to
a different inode in the init mount namespace compared to the
container's mount namespace. The secret sauce of the /proc/$pid/fd/$fd
files is that they don't behave like a normal symlink: instead of using
the file path to lookup the target inode, they directly lookup the file
and inode of the target process's table.

When you do a readlink(), the kernel has to create a path string, and it
has to do it from the perspective of the mount namespace of $pid, not
your monitoring command. The reason is that there may not even be a
corresponding path outside the mount namespace of $pid. Imagine I
created and opened "/etc/foobar" inside the container: that file may not
exist outside the container, so how could readlink() make a path
specific to your mount namespace?

Hopefully this helps, but maybe I'm off base and missing the thrust of
your question, let me know either way.

Stephen

>
> I get that if I could look at the device IDs in qemu or use nsenter to
> switch to its namespace, then the device should be 44 for the symlink
> and device (which it is and seems correct to me).
>
> How to replicate
> =============
> # uname -a
> Linux elmo 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15)
> x86_64 GNU/Linux
>
> The easiest way to replicate this is to make a qemu virtual machine and
> have it mount a block device. I suspect there are other ways, but I
> don't have many things that mount a device and switch namespaces. The
> qemu process (here it is 136775) will have a different mount namespace.
>
> # ps -o pid,mntns,comm $$ 136775
> PID MNTNS COMMAND
> 136775 4026532762 qemu-system-x86
> 142359 4026531840 bash
>
> File descriptor 23 is what qemu is using to mount the block device
> # ls -l /proc/136775/fd/23
> lrwx------ 1 libvirt-qemu libvirt-qemu 64 Apr 12 16:34
> /proc/136775/fd/23 -> /dev/dm-8
>
> However, the dereferenced symlink and where the symlink points to show
> different data.
>
> # stat -L /proc/136775/fd/23
> File: /proc/136775/fd/23
> Size: 0 Blocks: 0 IO Block: 4096 block special file
> Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8
> Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu)
> Access: 2022-04-12 16:34:25.687147886 +1000
> Modify: 2022-04-12 16:34:25.519151533 +1000
> Change: 2022-04-12 16:34:25.595149882 +1000
> Birth: -
>
> # stat /dev/dm-8
> File: /dev/dm-8
> Size: 0 Blocks: 0 IO Block: 4096 block special file
> Device: 5h/5d Inode: 348 Links: 1 Device type: fd,8
> Access: (0660/brw-rw----) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2022-04-12 16:15:12.684434884 +1000
> Modify: 2022-04-12 16:15:12.684434884 +1000
> Change: 2022-04-12 16:15:12.684434884 +1000
> Birth: -
>
> If we change to the qemu process' mount namespace then we do see that
> /dev/dm-8 has the same device/inode as the symlink.
>
> # nsenter -m -t 136775 stat /dev/dm-8
> File: /dev/dm-8
> Size: 0 Blocks: 0 IO Block: 4096 block special file
> Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8
> Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu)
> Access: 2022-04-12 16:34:25.687147886 +1000
> Modify: 2022-04-12 16:34:25.519151533 +1000
> Change: 2022-04-12 16:34:25.595149882 +1000
> Birth: -
>
> Thanks for your time.
>
> - Craig