Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Calvin Owens
Date: Tue Jun 09 2015 - 21:39:58 EST
On Tuesday 06/09 at 14:13 -0700, Andrew Morton wrote:
> On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@xxxxxx> wrote:
>
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> >
> > This interface very useful because it allows userspace to stat()
> > deleted files that are still mapped by some process, which enables a
> > much quicker and more accurate answer to the question "How much disk
> > space is being consumed by files that are deleted but still mapped?"
> > than is currently possible.
>
> Why is that information useful?
>
> I could perhaps think of some use for "How much disk space is being
> consumed by files that are deleted but still open", but to count the
> mmapped-then-unlinked files while excluding the opened-then-unlinked
> files seems damned peculiar.
Let's phrase the question a bit more generically:
"How much disk space is being consumed by files that have been
unlinked, but are still referenced by some process?"
There are two pieces to this problem:
1) Unlinked files that are still open (whether mapped or not)
2) Unlinked files that are not open, but are still mapped
You can track down everything in (1) using /proc/<pid>/fd/*, and you
can use stat() to figure out how much space they're using.
But directly measuring how much space (2) consumes is actually not
currently possible from userspace: there's no way to stat() the files.
You can get the inode number from /proc/<pid>/maps, but that still
doesn't get you anywhere because it's been unlinked from the
filesystem.
So I'm not looking to measure (2) and exclude (1): I'm looking to have
a way to directly measure (2) at all.
The reason I say "directly", and I say "quicker and more accurate" in
the original message, is that there is a very ugly way to answer this
question right now: you sum up the number of blocks used by every file
on the disk and subtract it from what statfs() tells you. This
obviously stinks, and becomes untenable once your filesystem is large
enough.
> IOW, this changelog failed to explain the value of the patch. Bad
> changelog! Please sell it to us. Preferably with real-world use
> cases.
The real-world use case is catching long-lived processes that leak
references to temporary files and waste space on the disk. When such
processes leak file-backed mappings, this wasted space is especially
difficult to detect until it gets out of hand. The map_files/
interface eliminates this difficulty.
I've included a little test program at the end of this file to illustrate
what I'm getting at here. It creates a file at /tmp/DELETEDFILE:
calvinowens@Haydn:~$ gcc test.c
calvinowens@Haydn:~$ ./a.out &
[1] 5832
Holding mapping at 0x7fe74d1ea000
calvinowens@Haydn:~$ lsof -p `pgrep a.out`
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
a.out 5832 calvinowens cwd DIR 254,1 4096 3413033 /home/calvinowens
a.out 5832 calvinowens rtd DIR 254,1 4096 2 /
a.out 5832 calvinowens txt REG 254,1 7512 3408268 /home/calvinowens/a.out
a.out 5832 calvinowens mem REG 254,1 1729984 4456767 /lib/x86_64-linux-gnu/libc-2.19.so
a.out 5832 calvinowens mem REG 254,1 140928 4456619 /lib/x86_64-linux-gnu/ld-2.19.so
a.out 5832 calvinowens mem REG 0,32 32768 184946 /tmp/DELETEDFILE
a.out 5832 calvinowens 0u CHR 136,2 0t0 5 /dev/pts/2
a.out 5832 calvinowens 1u CHR 136,2 0t0 5 /dev/pts/2
a.out 5832 calvinowens 2u CHR 136,2 0t0 5 /dev/pts/2
calvinowens@Haydn:~$ killall a.out
[1]+ Terminated ./a.out
calvinowens@Haydn:~$ gcc -DDO_UNLINK test.c
calvinowens@Haydn:~$ ./a.out &
[1] 5842
Holding mapping at 0x7fec8ae63000
calvinowens@Haydn:~$ lsof -p `pgrep a.out`
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
a.out 5842 calvinowens cwd DIR 254,1 4096 3413033 /home/calvinowens
a.out 5842 calvinowens rtd DIR 254,1 4096 2 /
a.out 5842 calvinowens txt REG 254,1 7640 3408268 /home/calvinowens/a.out
a.out 5842 calvinowens mem REG 254,1 1729984 4456767 /lib/x86_64-linux-gnu/libc-2.19.so
a.out 5842 calvinowens mem REG 254,1 140928 4456619 /lib/x86_64-linux-gnu/ld-2.19.so
a.out 5842 calvinowens DEL REG 0,32 184946 /tmp/DELETEDFILE
a.out 5842 calvinowens 0u CHR 136,2 0t0 5 /dev/pts/2
a.out 5842 calvinowens 1u CHR 136,2 0t0 5 /dev/pts/2
a.out 5842 calvinowens 2u CHR 136,2 0t0 5 /dev/pts/2
Notice the gap under "SIZE/OFF" in the 2nd output? This is because lsof
has no possible way to actually determine the leaked file's size.
That's the functionality "hole" I'm trying to fill with this patch.
Does that all seem sensible?
Thanks,
Calvin
--
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <limits.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
int main(void)
{
int ret, fd;
void *map;
fd = open("/tmp/DELETEDFILE", O_CREAT|O_TRUNC|O_RDWR, 0777);
if (fd == -1)
return -1;
ret = ftruncate(fd, 32768);
if (ret == -1)
return -1;
map = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
fd, 0);
if (map == MAP_FAILED)
return -1;
close(fd);
#ifdef DO_UNLINK
unlink("/tmp/DELETEDFILE");
#endif
printf("Holding mapping at %p\n", map);
while (1)
sleep(UINT_MAX);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/