Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

From: Robert Foss
Date: Wed Aug 17 2016 - 12:49:11 EST




On 2016-08-17 09:03 AM, Michal Hocko wrote:
On Wed 17-08-16 11:31:25, Jann Horn wrote:
On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]
$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
/proc/5025/smaps }"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
}"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

Ah, nice, that's a reasonable test program. :)


So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.

Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <err.h>
#include <stdio.h>

char buf[1000000];

int main(int argc, char **argv) {
printf("pid:%d\n", getpid());
while (1) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) continue;
if (read(fd, buf, sizeof(buf)) < 0)
err(1, "read");
close(fd);
}
}
$ gcc -Wall -o justreadloop justreadloop.c
$

Now launch your test:

$ ./mapstuff
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead Shared Object Symbol
30,43% [kernel] [k] format_decode
9,12% [kernel] [k] number
7,66% [kernel] [k] vsnprintf
7,06% [kernel] [k] __lock_acquire
3,23% [kernel] [k] lock_release
2,85% [kernel] [k] debug_lockdep_rcu_enabled
2,25% [kernel] [k] skip_atoi
2,13% [kernel] [k] lock_acquire
2,05% [kernel] [k] show_smap

This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.

That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format.

well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc/<pid>/stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.

@Sonny Rao: Maybe you can comment on how often, for how many processes this information is needed and for which reasons this information is useful.

---
From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@xxxxxxxx>
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers. Say we would like to get Rss and Pss from a particular
process. In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
User time (seconds): 0.01
System time (seconds): 0.44
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47

But the perf says:
22.55% awk [kernel.kallsyms] [k] format_decode
14.65% awk [kernel.kallsyms] [k] vsnprintf
6.40% awk [kernel.kallsyms] [k] number
2.53% awk [kernel.kallsyms] [k] shmem_mapping
2.53% awk [kernel.kallsyms] [k] show_smap
1.81% awk [kernel.kallsyms] [k] lock_acquire

we are spending most of the time actually generating the output which is
quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
This will give us:
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
User time (seconds): 0.00
System time (seconds): 0.41
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42

which will give us ~7% improvement. Perf says:
28.87% awk [kernel.kallsyms] [k] seq_puts
5.30% awk [kernel.kallsyms] [k] vsnprintf
4.54% awk [kernel.kallsyms] [k] format_decode
3.73% awk [kernel.kallsyms] [k] show_smap
2.56% awk [kernel.kallsyms] [k] shmem_mapping
1.92% awk [kernel.kallsyms] [k] number
1.80% awk [kernel.kallsyms] [k] lock_acquire
1.75% awk [kernel.kallsyms] [k] print_name_value_kb

Reported-by: Jann Horn <jann@xxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
1 file changed, 25 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84ef9de9..41c24c0811da 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
{
}

+static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
+{
+ seq_puts(m, name);
+ seq_put_decimal_ull(m, 0, val);
+ seq_puts(m, " kB\n");
+}
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
@@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)

show_map_vma(m, vma, is_pid);

- seq_printf(m,
- "Size: %8lu kB\n"
- "Rss: %8lu kB\n"
- "Pss: %8lu kB\n"
- "Shared_Clean: %8lu kB\n"
- "Shared_Dirty: %8lu kB\n"
- "Private_Clean: %8lu kB\n"
- "Private_Dirty: %8lu kB\n"
- "Referenced: %8lu kB\n"
- "Anonymous: %8lu kB\n"
- "AnonHugePages: %8lu kB\n"
- "ShmemPmdMapped: %8lu kB\n"
- "Shared_Hugetlb: %8lu kB\n"
- "Private_Hugetlb: %7lu kB\n"
- "Swap: %8lu kB\n"
- "SwapPss: %8lu kB\n"
- "KernelPageSize: %8lu kB\n"
- "MMUPageSize: %8lu kB\n"
- "Locked: %8lu kB\n",
- (vma->vm_end - vma->vm_start) >> 10,
- mss.resident >> 10,
- (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
- mss.shared_clean >> 10,
- mss.shared_dirty >> 10,
- mss.private_clean >> 10,
- mss.private_dirty >> 10,
- mss.referenced >> 10,
- mss.anonymous >> 10,
- mss.anonymous_thp >> 10,
- mss.shmem_thp >> 10,
- mss.shared_hugetlb >> 10,
- mss.private_hugetlb >> 10,
- mss.swap >> 10,
- (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
- vma_kernel_pagesize(vma) >> 10,
- vma_mmu_pagesize(vma) >> 10,
- (vma->vm_flags & VM_LOCKED) ?
+ print_name_value_kb(m, "Size: ", (vma->vm_end - vma->vm_start) >> 10);
+ print_name_value_kb(m, "Rss: ", mss.resident >> 10);
+ print_name_value_kb(m, "Pss: ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
+ print_name_value_kb(m, "Shared_Clean: ", mss.shared_clean >> 10);
+ print_name_value_kb(m, "Shared_Dirty: ", mss.shared_dirty >> 10);
+ print_name_value_kb(m, "Private_Clean: ", mss.private_clean >> 10);
+ print_name_value_kb(m, "Private_Dirty: ", mss.private_dirty >> 10);
+ print_name_value_kb(m, "Referenced: ", mss.referenced >> 10);
+ print_name_value_kb(m, "Anonymous: ", mss.anonymous >> 10);
+ print_name_value_kb(m, "AnonHugePages: ", mss.anonymous_thp >> 10);
+ print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
+ print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
+ print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
+ print_name_value_kb(m, "Swap: ", mss.swap >> 10);
+ print_name_value_kb(m, "SwapPss: ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
+ print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
+ print_name_value_kb(m, "MMUPageSize: ", vma_mmu_pagesize(vma) >> 10);
+ print_name_value_kb(m, "Locked: ", (vma->vm_flags & VM_LOCKED) ?
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
-
arch_show_smap(m, vma);
show_smap_vma_flags(m, vma);
m_cache_vma(m, vma);