Re: [PATCH 3/3] ring_buffer: Use try_cmpxchg instead of cmpxchg

From: Uros Bizjak
Date: Wed Mar 01 2023 - 12:16:23 EST


On Wed, Mar 1, 2023 at 5:18 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Wed, 1 Mar 2023 10:37:28 +0100
> Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > > No, val should not be updated.
> >
> > Please see the definition of try_cmpxchg. The definition is written in
> > such a way that benefits loops as well as linear code and in the later
> > case depends on the compiler to eliminate assignment to val as a dead
> > assignment.
>
> I did read it ;-)
>
> >
> > The above change was done under the assumption that val is unused
> > after try_cmpxchg, and can be considered as a temporary
> > [Alternatively, the value could be copied to a local temporary and a
> > pointer to this local temporary could be passed to try_cmpxchg
> > instead. Compiler is smart enough to eliminate the assignment in any
> > case.]
> >
> > Even in the linear code, the change has considerable effect.
> > rb_head_page_replace is inlined in rb_get_reader_page and gcc-10.3.1
> > improves code from:
> >
> > ef8: 48 8b 0e mov (%rsi),%rcx
> > efb: 48 83 e1 fc and $0xfffffffffffffffc,%rcx
> > eff: 48 83 c9 01 or $0x1,%rcx
> > f03: 48 89 c8 mov %rcx,%rax
> > f06: f0 48 0f b1 3e lock cmpxchg %rdi,(%rsi)
> > f0b: 48 39 c1 cmp %rax,%rcx
> > f0e: 74 3b je f4b <rb_get_reader_page+0x13b>
> >
> > to:
> >
> > ed8: 48 8b 01 mov (%rcx),%rax
> > edb: 48 83 e0 fc and $0xfffffffffffffffc,%rax
> > edf: 48 83 c8 01 or $0x1,%rax
> > ee3: f0 48 0f b1 31 lock cmpxchg %rsi,(%rcx)
> > ee8: 74 3b je f25 <rb_get_reader_page+0x135>
>
> I'm using gcc 12.2.0 and have this;
>
> cmpxchg:
>
> 0000000000000e50 <rb_get_reader_page>:
> e50: 41 55 push %r13
> e52: 41 54 push %r12
> e54: 55 push %rbp
> e55: 53 push %rbx
> e56: 48 89 fb mov %rdi,%rbx
> e59: 9c pushf
> e5a: 5d pop %rbp
> e5b: fa cli
> e5c: 81 e5 00 02 00 00 and $0x200,%ebp
> e62: 0f 85 e6 01 00 00 jne 104e <rb_get_reader_page+0x1fe>
> e68: 48 8d 7b 1c lea 0x1c(%rbx),%rdi
> e6c: 31 c0 xor %eax,%eax
> e6e: ba 01 00 00 00 mov $0x1,%edx
> e73: f0 0f b1 53 1c lock cmpxchg %edx,0x1c(%rbx)
> e78: 0f 85 e5 01 00 00 jne 1063 <rb_get_reader_page+0x213>
> e7e: 41 bc 03 00 00 00 mov $0x3,%r12d
> e84: 4c 8b 6b 58 mov 0x58(%rbx),%r13
> e88: 49 8b 55 30 mov 0x30(%r13),%rdx
> e8c: 41 8b 45 18 mov 0x18(%r13),%eax
> e90: 48 8b 4a 08 mov 0x8(%rdx),%rcx
> e94: 39 c8 cmp %ecx,%eax
> e96: 0f 82 61 01 00 00 jb ffd <rb_get_reader_page+0x1ad>
> e9c: 48 8b 52 08 mov 0x8(%rdx),%rdx
>
>
> try_cmpxchg:
>
> 0000000000000e70 <rb_get_reader_page>:
> e70: 41 55 push %r13
> e72: 41 54 push %r12
> e74: 55 push %rbp
> e75: 53 push %rbx
> e76: 48 89 fb mov %rdi,%rbx
> e79: 9c pushf
> e7a: 5d pop %rbp
> e7b: fa cli
> e7c: 81 e5 00 02 00 00 and $0x200,%ebp
> e82: 0f 85 e0 01 00 00 jne 1068 <rb_get_reader_page+0x1f8>
> e88: 48 8d 7b 1c lea 0x1c(%rbx),%rdi
> e8c: 31 c0 xor %eax,%eax
> e8e: ba 01 00 00 00 mov $0x1,%edx
> e93: f0 0f b1 53 1c lock cmpxchg %edx,0x1c(%rbx)
> e98: 0f 85 df 01 00 00 jne 107d <rb_get_reader_page+0x20d>
> e9e: 41 bc 03 00 00 00 mov $0x3,%r12d
> ea4: 4c 8b 6b 58 mov 0x58(%rbx),%r13
> ea8: 49 8b 55 30 mov 0x30(%r13),%rdx
> eac: 41 8b 45 18 mov 0x18(%r13),%eax
> eb0: 48 8b 4a 08 mov 0x8(%rdx),%rcx
> eb4: 39 c8 cmp %ecx,%eax
> eb6: 0f 82 5b 01 00 00 jb 1017 <rb_get_reader_page+0x1a7>
> ebc: 48 8b 52 08 mov 0x8(%rdx),%rdx
>
> Which has no difference :-/

This lock cmpxchg belongs to some other locking primitive that were
already converted en masse to try_cmpxchg some time in the past. The
place we are looking for has a compare insn between lock cmpxchg and a
follow-up conditional jump in the original assembly code.

> > Again, even in linear code the change is able to eliminate the
> > assignment to a temporary reg and the compare. Please note that there
> > is no move *from* %rax register after cmpxchg, so the compiler
> > correctly eliminated unused assignment.
> >
> > >
> > > > }
> > > >
> > > > /*
> > > > @@ -2055,7 +2052,7 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
> > > > retries = 10;
> > > > success = false;
> > > > while (retries--) {
> > > > - struct list_head *head_page, *prev_page, *r;
> > > > + struct list_head *head_page, *prev_page;
> > > > struct list_head *last_page, *first_page;
> > > > struct list_head *head_page_with_bit;
> > > >
> > > > @@ -2073,9 +2070,8 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
> > > > last_page->next = head_page_with_bit;
> > > > first_page->prev = prev_page;
> > > >
> > > > - r = cmpxchg(&prev_page->next, head_page_with_bit, first_page);
> > > > -
> > > > - if (r == head_page_with_bit) {
> > > > + if (try_cmpxchg(&prev_page->next,
> > > > + &head_page_with_bit, first_page)) {
> > >
> > > No. head_page_with_bit should not be updated.
> >
> > As above, head_page_with_bit should be considered as a temporary, it
> > is initialized a couple of lines above cmpxchg and unused after. The
> > gcc-10.3.1 compiler even found some more optimization opportunities
> > and reordered the code from:
> >
> > 1364: 4d 8b 86 38 01 00 00 mov 0x138(%r14),%r8
> > 136b: 48 83 ce 01 or $0x1,%rsi
> > 136f: 48 89 f0 mov %rsi,%rax
> > 1372: 49 89 30 mov %rsi,(%r8)
> > 1375: 48 89 4f 08 mov %rcx,0x8(%rdi)
> > 1379: f0 48 0f b1 39 lock cmpxchg %rdi,(%rcx)
> > 137e: 48 39 c6 cmp %rax,%rsi
> > 1381: 74 78 je 13fb <rb_insert_pages+0xdb>
> >
> > to:
> >
> > 1343: 48 83 c8 01 or $0x1,%rax
> > 1347: 48 8b bb 38 01 00 00 mov 0x138(%rbx),%rdi
> > 134e: 48 89 07 mov %rax,(%rdi)
> > 1351: 48 89 4e 08 mov %rcx,0x8(%rsi)
> > 1355: f0 48 0f b1 31 lock cmpxchg %rsi,(%rcx)
> > 135a: 41 0f 94 c7 sete %r15b
> > 135e: 75 2f jne 138f <rb_insert_pages+0x8f>
> >
> > Please also note SETE insn in the above code, this is how the
> > "success" variable is handled in the loop. So, besides code size
> > improvement, other secondary improvements can be expected from the
> > change, too.
>
> For this gcc 12.2.0 did have a different result:
>
> cmpxchg:
>
> 1436: 49 89 c5 mov %rax,%r13
> 1439: eb 41 jmp 147c <rb_update_pages+0x7c>
> 143b: 48 89 df mov %rbx,%rdi
> 143e: e8 bd ed ff ff call 200 <rb_set_head_page>
> 1443: 48 89 c2 mov %rax,%rdx
> 1446: 48 85 c0 test %rax,%rax
> 1449: 74 37 je 1482 <rb_update_pages+0x82>
> 144b: 48 8b 48 08 mov 0x8(%rax),%rcx
> 144f: 48 8b bb 30 01 00 00 mov 0x130(%rbx),%rdi
> 1456: 48 89 c6 mov %rax,%rsi
> 1459: 4c 8b 83 38 01 00 00 mov 0x138(%rbx),%r8
> 1460: 48 83 ce 01 or $0x1,%rsi
> 1464: 48 89 f0 mov %rsi,%rax
> 1467: 49 89 30 mov %rsi,(%r8)
> 146a: 48 89 4f 08 mov %rcx,0x8(%rdi)
> 146e: f0 48 0f b1 39 lock cmpxchg %rdi,(%rcx)
> 1473: 48 39 c6 cmp %rax,%rsi
> 1476: 0f 84 97 01 00 00 je 1613 <rb_update_pages+0x213>
> 147c: 41 83 ee 01 sub $0x1,%r14d
> 1480: 75 b9 jne 143b <rb_update_pages+0x3b>
> 1482: 48 8b 43 10 mov 0x10(%rbx),%rax
> 1486: f0 ff 40 08 lock incl 0x8(%rax)
>
> try_cmpxchg:
>
> 1446: 49 89 c4 mov %rax,%r12
> 1449: 41 83 ee 01 sub $0x1,%r14d
> 144d: 0f 84 7b 01 00 00 je 15ce <rb_update_pages+0x1be>
> 1453: 48 89 df mov %rbx,%rdi
> 1456: e8 c5 ed ff ff call 220 <rb_set_head_page>
> 145b: 48 89 c2 mov %rax,%rdx
> 145e: 48 85 c0 test %rax,%rax
> 1461: 0f 84 67 01 00 00 je 15ce <rb_update_pages+0x1be>
> 1467: 48 8b 48 08 mov 0x8(%rax),%rcx
> 146b: 48 8b b3 30 01 00 00 mov 0x130(%rbx),%rsi
> 1472: 48 83 c8 01 or $0x1,%rax
> 1476: 48 8b bb 38 01 00 00 mov 0x138(%rbx),%rdi
> 147d: 48 89 07 mov %rax,(%rdi)
> 1480: 48 89 4e 08 mov %rcx,0x8(%rsi)
> 1484: f0 48 0f b1 31 lock cmpxchg %rsi,(%rcx)
> 1489: 75 be jne 1449 <rb_update_pages+0x39>
> 148b: 48 89 7a 08 mov %rdi,0x8(%rdx)
> 148f: 4c 89 e6 mov %r12,%rsi
> 1492: 48 89 ef mov %rbp,%rdi
> 1495: 4c 89 ab 30 01 00 00 mov %r13,0x130(%rbx)
> 149c: 4c 89 ab 38 01 00 00 mov %r13,0x138(%rbx)
> 14a3: e8 00 00 00 00 call 14a8 <rb_update_pages+0x98>
>
> It's different, but I'm not so sure it's beneficial.

This is the place we are looking for. Please see that a move at $1464
and a compare at $1473 is missing in the assembly from the patched
code. If it is beneficial ... well, we achieved the same result with
two instructions less in a loopy code.

Uros.

> >
> > I think that the above examples demonstrate various improvements that
> > can be achieved with effectively a one-line, almost mechanical change
> > to the code, even in linear code. It would be unfortunate to not
> > consider them.
>
> But with gcc 12.2.0 I don't really see the benefit. And I'm worried that
> the side effect of modifying the old variable could cause a bug in the
> future, if it is used after the try_cmpxchg(). At least for the second case.
>
> -- Steve
>