Re: [PATCH v1 00/19] Increase resolution of load weights

From: Ingo Molnar
Date: Thu May 12 2011 - 06:56:19 EST

Next message: Yong Zhang: "Re: lock_stat &rq->lock/1 class name meaning"
Previous message: Cheng Xu: "Re: [PATCH] sched: rt_rq runtime leakage bug fix"
In reply to: Nikhil Rao: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Next in thread: Nikhil Rao: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Nikhil Rao <ncrao@xxxxxxxxxx> wrote:

> On Tue, May 10, 2011 at 11:59 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
> >
> > * Nikhil Rao <ncrao@xxxxxxxxxx> wrote:
> >
> >> > Also, the above (and the other scale-adjustment changes) probably explains
> >> > why the instruction count went up on 64-bit.
> >>
> >> Yes, that makes sense. We see an increase in instruction count of about 2%
> >> with the new version of the patchset, down from 5.8% (will post the new
> >> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is
> >> an effective increase of approx. 6.7%. I'll post the data and some analysis
> >> along with the new version.
> >
> > An instruction count increase does not necessarily mean a linear slowdown: if
> > those instructions are cheaper or scheduled better by the CPU then often the
> > slowdown will be less.
> >
> > Sometimes a 1% increase in the instruction count can slow down a workload by
> > 5%, if the 1% increase does divisions, has complex data path dependencies or is
> > missing the branch-cache a lot.
> >
> > So you should keep an eye on the cycle count as well. Latest -tip's perf stat
> > can also measure 'stalled cycles':
> >
> > aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m
> >
> > Performance counter stats for './pipe-test-1m' (3 runs):
> >
> > 6499.787926 task-clock # 0.437 CPUs utilized ( +- 0.41% )
> > 2,000,108 context-switches # 0.308 M/sec ( +- 0.00% )
> > 0 CPU-migrations # 0.000 M/sec ( +-100.00% )
> > 147 page-faults # 0.000 M/sec ( +- 0.00% )
> > 14,226,565,939 cycles # 2.189 GHz ( +- 0.49% )
> > 6,897,331,129 stalled-cycles-frontend # 48.48% frontend cycles idle ( +- 0.90% )
> > 4,230,895,459 stalled-cycles-backend # 29.74% backend cycles idle ( +- 1.31% )
> > 14,002,256,109 instructions # 0.98 insns per cycle
> > # 0.49 stalled cycles per insn ( +- 0.02% )
> > 2,703,891,945 branches # 415.997 M/sec ( +- 0.02% )
> > 44,994,805 branch-misses # 1.66% of all branches ( +- 0.27% )
> >
> > 14.859234036 seconds time elapsed ( +- 0.19% )
> >
> > Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes
> > the CPU's resources optimally. Looking at a 'perf record -e
> > stalled-cycles-frontend' and 'perf report' will show you the problem areas.
> >
> > Most of the 'problem areas' will be unrelated to your code.
> >
> > A 'near perfectly utilized' CPU looks like this:
> >
> > aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b
> >
> > Performance counter stats for './fill_1b' (10 runs):
> >
> > 1880.489837 task-clock # 0.998 CPUs utilized ( +- 0.15% )
> > 36 context-switches # 0.000 M/sec ( +- 19.87% )
> > 1 CPU-migrations # 0.000 M/sec ( +- 59.63% )
> > 99 page-faults # 0.000 M/sec ( +- 0.10% )
> > 6,027,432,226 cycles # 3.205 GHz ( +- 0.15% )
> > 22,138,455 stalled-cycles-frontend # 0.37% frontend cycles idle ( +- 36.56% )
> > 16,400,224 stalled-cycles-backend # 0.27% backend cycles idle ( +- 38.12% )
> > 18,008,803,113 instructions # 2.99 insns per cycle
> > # 0.00 stalled cycles per insn ( +- 0.00% )
> > 1,001,802,536 branches # 532.735 M/sec ( +- 0.01% )
> > 22,842 branch-misses # 0.00% of all branches ( +- 9.07% )
> >
> > 1.884595529 seconds time elapsed ( +- 0.15% )
> >
> > Both stall counts are very low. This is pretty hard to achieve in general, so
> > before/after comparisons are used. For that there's 'perf diff' which you can
> > use to compare before/after profiles:
> >
> > aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
> > [ perf record: Woken up 2 times to write data ]
> > [ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
> > aldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
> > [ perf record: Woken up 2 times to write data ]
> > [ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
> > aldebaran:~/sched-tests> perf diff | head -10
> > # Baseline Delta Shared Object Symbol
> > # ........ .......... ................. .............................
> > #
> > 2.68% +0.84% [kernel.kallsyms] [k] select_task_rq_fair
> > 3.28% -0.17% [kernel.kallsyms] [k] fsnotify
> > 2.67% +0.13% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> > 2.46% +0.11% [kernel.kallsyms] [k] pipe_read
> > 2.42% [kernel.kallsyms] [k] schedule
> > 2.11% +0.28% [kernel.kallsyms] [k] copy_user_generic_string
> > 2.13% +0.18% [kernel.kallsyms] [k] mutex_lock
> >
> > ( Note: these were two short runs on the same kernel so the diff shows the
> > natural noise of the profile of this workload. Longer runs are needed to
> > measure effects smaller than 1%. )
> >
> > So there's a wide range of tools you can use to understand the precise
> > performance impact of your patch and in turn you can present to us what you
> > learned about it.
> >
> > Such analysis saves quite a bit of time on the side of us scheduler maintainers
> > and makes performance impacting patches a lot more easy to apply :-)
> >
>
> Thanks for the info! I rebased the patchset against -tip and built
> perf from -tip. Here are the results from running pipe-test-100k bound
> to a single cpu with 100 repetitions.
>
> -tip (baseline):
>
> Performance counter stats for '/root/data/pipe-test-100k' (100 runs):
>
> 907,981,999 instructions # 0.85 insns per cycle
> # 0.34 stalled cycles
> per insn ( +- 0.07% )
> 1,072,650,809 cycles # 0.000 GHz
> ( +- 0.13% )
> 305,678,413 stalled-cycles-backend # 28.50% backend cycles
> idle ( +- 0.51% )
> 245,846,208 stalled-cycles-frontend # 22.92% frontend cycles
> idle ( +- 0.70% )
>
> 1.060303165 seconds time elapsed ( +- 0.09% )
>
>
> -tip+patches:
>
> Performance counter stats for '/root/data/pipe-test-100k' (100 runs):
>
> 910,501,358 instructions # 0.82 insns per cycle
> # 0.36 stalled cycles
> per insn ( +- 0.06% )
> 1,108,981,763 cycles # 0.000 GHz
> ( +- 0.17% )
> 328,816,295 stalled-cycles-backend # 29.65% backend cycles
> idle ( +- 0.63% )
> 247,412,614 stalled-cycles-frontend # 22.31% frontend cycles
> idle ( +- 0.71% )
>
> 1.075497493 seconds time elapsed ( +- 0.10% )
>
>
> >From this latest run on -tip, the instruction count is about ~0.28%
> more and cycles are approx 3.38% more. From the stalled cycles counts,
> it looks like most of this increase is coming from backend stalled
> cycles. It's not clear what type of stalls these are, but if I were to
> guess, I think it means stalls post-decode (i.e. functional units,
> load/store, etc.). Is that right?

Yeah, more functional work to be done, and probably a tad more expensive per
extra instruction executed.

How did branches and branch misses change?

> Another thing I noticed while running this on -tip is that low-weight
> task groups are poorly balanced on -tip (much worse than v2.6.39-rc7).
> I started bisecting between v2.6.39-rc7 and -tip to identify the
> source of this regression.

Ok, would be nice to figure out which commit did this.

> I collected profiles from long runs of pipe-test (about 3m iterations)
> and tried running "perf diff" on the profiles. I cached the buildid
> from the two kernel images and associated test binary & libraries. The
> individual reports make sense, but I suspect something is wrong with
> the diff output.

Ok, i've Cc:-ed Arnaldo and Frederic, the perf diff output indeed looks
strange. (the perf diff output is repeated below.)

Thanks,

Ingo

> # perf buildid-cache -v -a boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV
> Adding 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
> boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV: Ok
> # perf buildid-cache -v -a boot.tip/vmlinux-2.6.39-tip-smp-DEV
> Adding 47737eb3efdd6cb789872311c354b106ec8e7477
> p/boot.tip/vmlinux-2.6.39-tip-smp-DEV: Ok
>
> # perf buildid-list -i perf.data | grep kernel
> 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1 [kernel.kallsyms]
>
> # perf buildid-list -i perf.data.old | grep kernel
> 47737eb3efdd6cb789872311c354b106ec8e7477 [kernel.kallsyms]
>
> # perf report -i perf.data.old -d [kernel.kallsyms] | head -n 10
> # dso: [kernel.kallsyms]
> # Events: 30K instructions
> #
> # Overhead Command Symbol
> # ........ ............ ...........................
> #
> 5.55% pipe-test-3m [k] pipe_read
> 4.78% pipe-test-3m [k] schedule
> 3.68% pipe-test-3m [k] update_curr
> 3.52% pipe-test-3m [k] pipe_write
>
>
> # perf report -i perf.data -d [kernel.kallsyms] | head -n 10
> # dso: [kernel.kallsyms]
> # Events: 31K instructions
> #
> # Overhead Command Symbol
> # ........ ............ .....................................
> #
> 6.09% pipe-test-3m [k] pipe_read
> 4.86% pipe-test-3m [k] schedule
> 4.24% pipe-test-3m [k] update_curr
> 3.87% pipe-test-3m [k] find_next_bit
>
>
> # perf diff -v -d [kernel.kallsyms]
> build id event received for [kernel.kallsyms]:
> 47737eb3efdd6cb789872311c354b106ec8e7477
> ...
> build id event received for [kernel.kallsyms]:
> 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
> ...
> Looking at the vmlinux_path (6 entries long)
> Using /tmp/.debug/.build-id/47/737eb3efdd6cb789872311c354b106ec8e7477
> for symbols
> Looking at the vmlinux_path (6 entries long)
> Using /tmp/.debug/.build-id/17/b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
> for symbols
> # Baseline Delta Symbol
> # ........ .......... .....................................
> #
> 0.00% +6.09% 0xffffffff8112a258 ! [k] pipe_read
> 0.00% +4.86% 0xffffffff8141a206 ! [k] schedule
> 0.00% +4.24% 0xffffffff810634d8 ! [k] update_curr
> 0.00% +3.87% 0xffffffff8121f569 ! [k] find_next_bit
> 0.00% +3.33% 0xffffffff81065cbf ! [k] enqueue_task_fair
> 0.00% +3.25% 0xffffffff81065824 ! [k] dequeue_task_fair
> 0.00% +2.77% 0xffffffff81129d10 ! [k] pipe_write
> 0.00% +2.71% 0xffffffff8114ed97 ! [k] fsnotify
>
> The baseline numbers are showing up as zero and the deltas match the
> fractions from the -tip+patches report. Am I missing something here?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Yong Zhang: "Re: lock_stat &rq->lock/1 class name meaning"
Previous message: Cheng Xu: "Re: [PATCH] sched: rt_rq runtime leakage bug fix"
In reply to: Nikhil Rao: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Next in thread: Nikhil Rao: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]