Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

From: Dave Chinner
Date: Mon Aug 15 2016 - 01:00:25 EST


On Sun, Aug 14, 2016 at 07:53:40PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >>
> >> Maybe your symbol table came from a old kernel, and functions moved
> >> around enough that the profile attributions ended up bogus.
> >
> > No, I don't think so. I don't install symbol tables on my test VMs,
> > I let /proc/kallsyms do that work for me. From an strace of 'perf
> > top -U -g":
>
> Ok. But something is definitely wrong with your profile.
>
> What does it say if you annotate that _raw_spin_unlock_irqrestore() function?
....

raw_spin_unlock_irqrestore /proc/kcore
¿
¿
¿
¿ Disassembly of section load0:
¿
¿ ffffffff81e628b0 <load0>:
¿ nop
¿ push %rbp
¿ mov %rsp,%rbp
¿ movb $0x0,(%rdi)
¿ nop
¿ mov %rsi,%rdi
¿ push %rdi
¿ popfq
99.35 ¿ nop
¿ decl %gs:0x7e1a9bc7(%rip)
0.65 ¿ ¿ je 25
¿ pop %rbp
¿ ¿ retq
¿25: callq 0xffffffff81002000
¿ pop %rbp
¿ ¿ retq


> I guarantee you that no, it's not spending 41% of time in
> spin_unlock_irqrestore. That just isn't a valid profile. There's
> something seriously wrong somewhere.
>
> The fact that you then get the same profile when you turn _off_
> CONFIG_DEBUG_SPINLOCK only proves there is something going on that is
> pure garbage.
>
> I suspect that what you did was to edit the .config file, remove
> DEBUG_SPINLOCK, and then do "make oldconfig" again.

Yes.

> And it got turned on again,

No. I'm not that stupid - I checked:

$ grep SPINLOCK .config
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_DEBUG_SPINLOCK is not set
$

> because you have one of the lock debugging
> options on that force spinlock debuggin on again:

> - DEBUG_WW_MUTEX_SLOWPATH
> - DEBUG_LOCK_ALLOC
> - PROVE_LOCKING

None of which are set:

$ grep 'DEBUG\|PROVE' .config |grep -v '#'
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_RODATA=y
CONFIG_SLUB_DEBUG=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PM_DEBUG=y
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_DEBUG_DEVRES=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_XFS_DEBUG=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
CONFIG_CIFS_DEBUG=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_LIST=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_DEBUG_BOOT_PARAMS=y
$

> [ Light goes on ]
>
> Oh, no, I can see another possibility: you're not doing proper CPU
> profiles, you're doing some timer-irq profile, and the reason you get
> 41% on spin_unlock_irq_restore() is that that is where the interrupts
> are enabled again.
>
> Timer-interrupt based profiles are not useful either.

I've just been using whatever perf defaults to. Defaults are
supposed to be useful - if they aren't then perf needs to be fixed.
perf top reports this by default:

Samples: 118K of event 'cpu-clock', Event count (approx.): 793748915
Overhead Shared O Symbol ¿
34.48% [kernel] [k] _raw_spin_unlock_irqrestore ¿
7.89% [kernel] [k] copy_user_generic_string ¿
5.08% [kernel] [k] _raw_spin_unlock_irq

...

> Make sure you actually use "perf record -e cycles:pp" or something
> that uses PEBS to get real profiles using CPU performance counters.

WTF is PEBS? I'm not a CPU nerd, and I certainly don't expect to
have to learn all the intricacies of hardware performance counters
just to profile the kernel in a correct and sane manner. That's what
the *perf defaults* are supposed to do.


Anyway: `perf top -U -e cycles:pp`:

Samples: 301K of event 'cpu-clock:ppH', Event count (approx.): 69364814
Overhead Shared O Symbol ¿
30.89% [kernel] [k] _raw_spin_unlock_irqrestore ¿
7.04% [kernel] [k] _raw_spin_unlock_irq ¿
4.08% [kernel] [k] copy_user_generic_string ¿
2.44% [kernel] [k] get_page_from_freelist ¿
1.81% [kernel] [k] _raw_spin_lock

No change.

$ sudo perf record -e cycles:pp -a --all-kernel -- xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey

# Samples: 2M of event 'cpu-clock:khppH'
# Event count (approx.): 588517250000
#
# Overhead Command Shared Object Symbol
# ........ ............... ................. ..........................................
#
83.09% swapper [kernel.kallsyms] [k] native_safe_halt
1.42% xfs_io [kernel.kallsyms] [k] copy_user_generic_string
1.26% kswapd3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.24% kswapd1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.09% kswapd2 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.98% kswapd0 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.80% xfs_io [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.77% kworker/u34:2 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.73% xfs_io [kernel.kallsyms] [k] _raw_spin_unlock_irq
0.51% xfs_io [kernel.kallsyms] [k] get_page_from_freelist
0.39% xfs_io [kernel.kallsyms] [k] __block_commit_write.isra.29
0.16% xfs_io [kernel.kallsyms] [k] _raw_spin_lock
0.14% xfs_io [kernel.kallsyms] [k] up_write
0.14% kworker/u34:2 [kernel.kallsyms] [k] clear_page_dirty_for_io
0.14% kworker/u34:2 [kernel.kallsyms] [k] xfs_do_writepage
....

It's exactly the same profile, just reported as a percentage of 16
CPUs rather than normalised to a single CPU. From my ignorant
viewpoing, I'd say that's expected because perf is still using
"cpu-clock" event configuration.

The hardware event counters are undocumented in the perf man pages,
perf-list doesn't output a single "cpu" or "cycles" event counter,
or even what hardware event counters are available. Hence I've got
no idea if it's broken, why "cycles" (or "cpu-cycles") doesn't
apparently record "cycle" triggered events, or even what perf is
supposed to tell me is it's recording cycle triggered events.

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx