Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)

From: Michal Hocko
Date: Fri Sep 30 2016 - 05:49:22 EST


[CC Mike and Mel as they have seen some accounting oddities
when doing performance testing. They can share details but
essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.GL23577@xxxxxxxxxxxxxx

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.45
System time (seconds): 0.82
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.47
System time (seconds): 1.55
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
2.01% as [kernel.vmlinux] [k] page_fault
0.59% cc1 [kernel.vmlinux] [k] page_fault
0.15% git [kernel.vmlinux] [k] page_fault
0.12% bash [kernel.vmlinux] [k] page_fault
0.11% sh [kernel.vmlinux] [k] page_fault
0.08% gcc [kernel.vmlinux] [k] page_fault
0.06% make [kernel.vmlinux] [k] page_fault
0.04% rm [kernel.vmlinux] [k] page_fault
0.03% ld [kernel.vmlinux] [k] page_fault
0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64
0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64
0.01% cat [kernel.vmlinux] [k] page_fault
0.01% collect2 [kernel.vmlinux] [k] page_fault
0.00% sh [kernel.vmlinux] [k] entry_SYSCALL_64
0.00% rm [kernel.vmlinux] [k] entry_SYSCALL_64
0.00% grep [kernel.vmlinux] [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> >
> > And just to give Rik more context. While debugging overhead of the
> > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> >
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> >         User time (seconds): 0.00
> >         System time (seconds): 0.45
> >         Percent of CPU this job got: 98%
> >
>
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> >         User time (seconds): 0.18
> >         System time (seconds): 0.29
> >         Percent of CPU this job got: 97%
>
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
>
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
>
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
>
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
>
> Does the workload consistently get zero user
> time?
>
> If so, we need to dig further to see under
> what precise circumstances that happens.
>
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
>
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
> User time (seconds): 0.64
> System time (seconds): 0.19
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83
>
> The main difference between your and my
> NO_HZ config seems to be that NO_HZ_FULL
> is set here. However, it is not enabled
> at run time, so both of our systems
> should only really get NO_HZ_IDLE
> effectively.
>
> Running tasks should get sampled with the
> regular timer tick, while they are running.
>
> In other words, vtime accounting should be
> disabled in both of our tests, for everything
> except the idle task.
>
> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?
>
>

--
Michal Hocko
SUSE Labs