Re: [PATCH 3/3] ring_buffer: Use try_cmpxchg instead of cmpxchg

From: Uros Bizjak
Date: Wed Mar 01 2023 - 04:37:47 EST


On Tue, Feb 28, 2023 at 10:43 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Tue, 28 Feb 2023 18:59:29 +0100
> Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old.
> > x86 CMPXCHG instruction returns success in ZF flag, so this change
> > saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
> >
> > Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
> > fails. There is no need to re-read the value in the loop.
> >
> > No functional change intended.
>
> As I mentioned in the RCU thread, I have issues with some of the changes
> here.
>
> >
> > Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> > Cc: Masami Hiramatsu <mhiramat@xxxxxxxxxx>
> > Signed-off-by: Uros Bizjak <ubizjak@xxxxxxxxx>
> > ---
> > kernel/trace/ring_buffer.c | 20 ++++++++------------
> > 1 file changed, 8 insertions(+), 12 deletions(-)
> >
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index 4188af7d4cfe..8f0ef7d12ddd 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1493,14 +1493,11 @@ static bool rb_head_page_replace(struct buffer_page *old,
> > {
> > unsigned long *ptr = (unsigned long *)&old->list.prev->next;
> > unsigned long val;
> > - unsigned long ret;
> >
> > val = *ptr & ~RB_FLAG_MASK;
> > val |= RB_PAGE_HEAD;
> >
> > - ret = cmpxchg(ptr, val, (unsigned long)&new->list);
> > -
> > - return ret == val;
> > + return try_cmpxchg(ptr, &val, (unsigned long)&new->list);
>
> No, val should not be updated.

Please see the definition of try_cmpxchg. The definition is written in
such a way that benefits loops as well as linear code and in the later
case depends on the compiler to eliminate assignment to val as a dead
assignment.

The above change was done under the assumption that val is unused
after try_cmpxchg, and can be considered as a temporary
[Alternatively, the value could be copied to a local temporary and a
pointer to this local temporary could be passed to try_cmpxchg
instead. Compiler is smart enough to eliminate the assignment in any
case.]

Even in the linear code, the change has considerable effect.
rb_head_page_replace is inlined in rb_get_reader_page and gcc-10.3.1
improves code from:

ef8: 48 8b 0e mov (%rsi),%rcx
efb: 48 83 e1 fc and $0xfffffffffffffffc,%rcx
eff: 48 83 c9 01 or $0x1,%rcx
f03: 48 89 c8 mov %rcx,%rax
f06: f0 48 0f b1 3e lock cmpxchg %rdi,(%rsi)
f0b: 48 39 c1 cmp %rax,%rcx
f0e: 74 3b je f4b <rb_get_reader_page+0x13b>

to:

ed8: 48 8b 01 mov (%rcx),%rax
edb: 48 83 e0 fc and $0xfffffffffffffffc,%rax
edf: 48 83 c8 01 or $0x1,%rax
ee3: f0 48 0f b1 31 lock cmpxchg %rsi,(%rcx)
ee8: 74 3b je f25 <rb_get_reader_page+0x135>

Again, even in linear code the change is able to eliminate the
assignment to a temporary reg and the compare. Please note that there
is no move *from* %rax register after cmpxchg, so the compiler
correctly eliminated unused assignment.

>
> > }
> >
> > /*
> > @@ -2055,7 +2052,7 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
> > retries = 10;
> > success = false;
> > while (retries--) {
> > - struct list_head *head_page, *prev_page, *r;
> > + struct list_head *head_page, *prev_page;
> > struct list_head *last_page, *first_page;
> > struct list_head *head_page_with_bit;
> >
> > @@ -2073,9 +2070,8 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
> > last_page->next = head_page_with_bit;
> > first_page->prev = prev_page;
> >
> > - r = cmpxchg(&prev_page->next, head_page_with_bit, first_page);
> > -
> > - if (r == head_page_with_bit) {
> > + if (try_cmpxchg(&prev_page->next,
> > + &head_page_with_bit, first_page)) {
>
> No. head_page_with_bit should not be updated.

As above, head_page_with_bit should be considered as a temporary, it
is initialized a couple of lines above cmpxchg and unused after. The
gcc-10.3.1 compiler even found some more optimization opportunities
and reordered the code from:

1364: 4d 8b 86 38 01 00 00 mov 0x138(%r14),%r8
136b: 48 83 ce 01 or $0x1,%rsi
136f: 48 89 f0 mov %rsi,%rax
1372: 49 89 30 mov %rsi,(%r8)
1375: 48 89 4f 08 mov %rcx,0x8(%rdi)
1379: f0 48 0f b1 39 lock cmpxchg %rdi,(%rcx)
137e: 48 39 c6 cmp %rax,%rsi
1381: 74 78 je 13fb <rb_insert_pages+0xdb>

to:

1343: 48 83 c8 01 or $0x1,%rax
1347: 48 8b bb 38 01 00 00 mov 0x138(%rbx),%rdi
134e: 48 89 07 mov %rax,(%rdi)
1351: 48 89 4e 08 mov %rcx,0x8(%rsi)
1355: f0 48 0f b1 31 lock cmpxchg %rsi,(%rcx)
135a: 41 0f 94 c7 sete %r15b
135e: 75 2f jne 138f <rb_insert_pages+0x8f>

Please also note SETE insn in the above code, this is how the
"success" variable is handled in the loop. So, besides code size
improvement, other secondary improvements can be expected from the
change, too.

I think that the above examples demonstrate various improvements that
can be achieved with effectively a one-line, almost mechanical change
to the code, even in linear code. It would be unfortunate to not
consider them.

Uros.