Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

From: Dave Chinner
Date: Tue Aug 16 2016 - 18:03:35 EST


On Mon, Aug 15, 2016 at 06:51:42PM -0700, Linus Torvalds wrote:
> Anyway, including the direct reclaim call paths gets
> __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
> 0.26%. But perhaps more importlantly, looking at what __remove_mapping
> actually *does* (apart from the spinlock) gives us:
>
> - inside remove_mapping itself (0.11% on its own - flat cost, no
> child accounting)
>
> 48.50 â lock cmpxchg %edx,0x1c(%rbx)
>
> so that's about 0.05%
>
> - 0.40% __delete_from_page_cache (0.22%
> radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)
>
> - 0.06% workingset_eviction()
>
> so I'm not actually seeing anything *new* expensive in there. The
> __delete_from_page_cache() overhead may have changed a bit with the
> tagged tree changes, but this doesn't look like memcg.
>
> But we clearly have very different situations.
>
> What does your profile show for when you actually dig into
> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
> you get

- 22.26% 0.93% [kernel] [k] __remove_mapping
- 3.86% __remove_mapping
- 18.35% _raw_spin_lock_irqsave
__pv_queued_spin_lock_slowpath
1.32% __delete_from_page_cache
- 0.92% _raw_spin_unlock_irqrestore
__raw_callee_save___pv_queued_spin_unlock

And the instruction level profile:

.....
ï xor %ecx,%ecx
ï mov %rax,%r15
0.39 ï mov $0x2,%eax
ï lock cmpxchg %ecx,0x1c(%rbx)
32.56 ï cmp $0x2,%eax
ï ï jne 12e
ï mov 0x20(%rbx),%rax
ï lea -0x1(%rax),%rdx
0.39 ï test $0x1,%al
ï cmove %rbx,%rdx
ï mov (%rdx),%rax
0.39 ï test $0x10,%al
ï ï jne 127
ï mov (%rbx),%rcx
ï shr $0xf,%rcx
ï and $0x1,%ecx
ï ï jne 14a
ï mov 0x68(%r14),%rax
36.03 ï xor %esi,%esi
ï test %r13b,%r13b
ï mov 0x50(%rax),%rdx
1.16 ï ï jne e8
0.96 ï a9: mov %rbx,%rdi
.....

Indicates most time on the cmpxchg for the page ref followed by the
grabbing on the ->freepage op vector:

freepage = mapping->a_ops->freepage;

> I come back to wondering whether maybe you're hitting some PV-lock problem.
>
> I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
> __pv_queued_spin_lock_slowpath() is.

It's the same code AFAICT, except the pv version jumps straight to
the "queue" case.

> So I'd love to see you try the non-PV case, but I also think it might
> be interesting to see what the instruction profile for
> __pv_queued_spin_lock_slowpath() itself is. They share a lot of code
> (there's some interesting #include games going on to make
> queued_spin_lock_slowpath() actually *be*
> __pv_queued_spin_lock_slowpath() with some magic hooks), but there
> might be issues.

0.03 ï data16 data16 data16 xchg %ax,%ax
ï push %rbp
0.00 ï mov %rsp,%rbp
0.01 ï push %r15
ï push %r14
ï push %r13
0.01 ï push %r12
ï mov $0x18740,%r12
ï push %rbx
ï mov %rdi,%rbx
ï sub $0x10,%rsp
ï add %gs:0x7ef0d0e0(%rip),%r12
ï movslq 0xc(%r12),%rax
0.02 ï mov %gs:0x7ef0d0db(%rip),%r15d
ï add $0x1,%r15d
ï shl $0x12,%r15d
ï lea 0x1(%rax),%edx
0.01 ï mov %edx,0xc(%r12)
ï mov %eax,%edx
ï shl $0x4,%rax
ï add %rax,%r12
ï shl $0x10,%edx
ï movq $0x0,(%r12)
0.02 ï or %edx,%r15d
ï mov %gs:0x7ef0d0ad(%rip),%eax
0.00 ï movl $0x0,0x8(%r12)
0.01 ï mov %eax,0x40(%r12)
ï movb $0x0,0x44(%r12)
ï mov (%rdi),%eax
0.88 ï test %ax,%ax
ï ï jne 8f
0.02 ï mov $0x1,%edx
ï lock cmpxchg %dl,(%rdi)
0.38 ï test %al,%al
ï ï je 14a
0.02 ï 8f: mov %r15d,%eax
ï shr $0x10,%eax
ï xchg %ax,0x2(%rbx)
2.07 ï shl $0x10,%eax
ï test %eax,%eax
ï ï jne 171
ï movq $0x0,-0x30(%rbp)
0.02 ï ac: movzbl 0x44(%r12),%eax
0.97 ï mov $0x1,%r13d
ï mov $0x100,%r14d
ï cmp $0x2,%al
ï sete %al
ï movzbl %al,%eax
ï mov %rax,-0x38(%rbp)
0.00 ï ca: movb $0x0,0x44(%r12)
0.00 ï mov $0x8000,%edx
ï movb $0x1,0x1(%rbx)
ï ï jmp e6
0.04 ï db: pause
8.04 ï sub $0x1,%edx
ï ï je 229
ï e6: movzbl (%rbx),%eax
7.54 ï test %al,%al
ï ï jne db
0.10 ï mov %r14d,%eax
0.06 ï lock cmpxchg %r13w,(%rbx)
2.93 ? cmp $0x100,%ax
ï ï jne db
ï fc: mov (%rbx),%edx
0.37 ï mov $0x1,%ecx
ï or $0x1,%edx
ï ï jmp 114
0.01 ï108: mov %edx,%eax
ï lock cmpxchg %ecx,(%rbx)
0.26 ï cmp %edx,%eax
ï ï je 14a
ï mov %eax,%edx
ï114: mov %edx,%eax
0.00 ï xor %ax,%ax
ï cmp %r15d,%eax
ï ï je 108
0.01 ï cmpq $0x0,-0x30(%rbp)
? movb $0x1,(%rbx)
ï ï je 251
ï12c: mov -0x30(%rbp),%rsi
0.01 ï mov $0x1,%eax
ï mov $0x2,%edx
ï movl $0x1,0x8(%rsi)
0.11 ï lock cmpxchg %dl,0x44(%rsi)
2.34 ï cmp $0x1,%al
ï ï je 160
ï14a: decl %gs:0x7ef1b5bb(%rip)
0.02 ï add $0x10,%rsp
ï pop %rbx
ï pop %r12
0.00 ï pop %r13
ï pop %r14
ï pop %r15
ï pop %rbp
ï ï retq
ï160: mov -0x30(%rbp),%rsi
ï movb $0x3,(%rbx)
ï mov %rbx,%rdi
ï callq 0xffffffff810fcf90
ï ï jmp 14a
ï171: lea 0x44(%r12),%r14
ï mov %rax,%r13
ï shr $0x12,%eax
ï shr $0xc,%r13
ï sub $0x1,%eax
ï and $0x30,%r13d
ï cltq
ï add $0x18740,%r13
? add -0x7d8164c0(,%rax,8),%r13
0.03 ï mov %r12,0x0(%r13)
0.38 ï19c: mov $0x8000,%eax
ï ï jmp 1b7
0.04 ï1a3: test %al,%al
ï ï jne 1b0
ï movzbl 0x44(%r13),%edx
1.66 ï test %dl,%dl
ï ï jne 1f1
1.75 ï1b0: pause
64.57 ï sub $0x1
0.04 ï ï mov 0x8(%r12),%eax
0.03 ï ï test %eax,%eax
ï ïï jne 1d4
ï1c9:ï pause
ï ï mov 0x8(%r12),%eax
ï ï test %eax,%eax
ï ïï je 1c9
ï1d4:ï mov (%r12),%rax
ï ï test %rax,%rax
ï ï mov %rax,-0x30(%rbp)
0.05 ï ïï je ac
ï ï mov -0x30(%rbp),%rax
ï ï prefet (%rax)
0.25 ï ïï jmpq ac
ï1f1:ï mov $0x1,%eax
ï ï xchg %al,0x44(%r12)
ï ï mov 0x8(%r12),%eax
ï ï test %eax,%eax
ï ïï jne 213
ï ï mov %r14,%rdi
ï ï mov $0x1,%esi
ï ï callq 0xffffffff8109f7a0
ï ï xchg %ax,%ax
ï213:ï mov $0x1,%eax
ï ï xor %edi,%edi
ï ï lock cmpxchg %dil,(%r14)
ï ï mov 0x8(%r12),%eax
ï ïï jmpq 19c
ï229:ï cmpq $0x0,-0x38(%rbp)
ï ï movb $0x0,0x1(%rbx)
ï ïï je 276
ï234:ï movb $0x1,0x44(%r12)
ï ï mov $0x3,%esi
ï ï mov %rbx,%rdi
ï ï callq 0xffffffff8109f7a0
ï ï xchg %ax,%ax
ï ï movzbl (%rbx),%eax
ï ïï jmpq ca
ï251:ï mov (%r12),%rax
0.14 ï ï test %rax,%rax
ï ï mov %rax,-0x30(%rbp)
ï ïï jne 12c
ï262:ï pause
0.31 ï ï mov (%r12),%rax
ï ï test %rax,%rax
ï ïï je 262
ï ï mov %rax,-0x30(%rbp)
ï ïï jmpq 12c
ï276:? mov %r12,%rsi
ï ï mov %rbx,%rdi
ï ï callq 0xffffffff810fcf90
ï ï mov %rax,-0x38(%rbp)
ï ï mov $0x3,%eax
ï ï xchg %al,(%rbx)
ï ï test %al,%al
ï ïï jne 234
ï ï mov -0x38(%rbp),%rax
ï ï movb $0x1,(%rbx)
ï ï movq $0x0,(%rax)
ï ïïïjmpq fc

> For example, if you run a virtual 16-core system on a physical machine
> that then doesn't consistently give 16 cores to the virtual machine,
> you'll get no end of hiccups.

I learnt that lesson 6-7 years ago when I first started doing
baseline benchmarking to compare bare metal to virtualised IO
performance.

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx