Re: [PATCH 0/3] Improve trace/ring_buffer.c

From: Uros Bizjak
Date: Wed Mar 01 2023 - 03:36:03 EST


On Tue, Feb 28, 2023 at 8:35 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Tue, 28 Feb 2023 18:59:26 +0100
> Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > This series improves ring_buffer.c by changing the type of some
> > static functions to void or bool and
>
> Well, it's not really an improvement unless it has a functional change. But
> I'm OK with taking these.

"No functional changes intended" claim should be taken in the sense
that the same functionality is achieved in a more optimal way. As a
trivial example, changing some functions to bool eliminates many
unnecessary zero-extensions and results in smaller code:

size ring_buffer-*.o
text data bss dec hex filename
25599 1762 4 27365 6ae5 ring_buffer-vanilla.o
25551 1762 4 27317 6ab5 ring_buffer-bool.o

Please also note that setting the output value with "SETcc r8" results
in a partial register stall on x86 when returning r32. The compiler
knows how to handle this issue (using register dependency breaking XOR
in front of SETcc), but it is better to avoid the issue altogether.
So, there are also secondary benefits of the above "No functional
changes intended" change, besides lower code size.

> > uses try_cmpxchg instead of
> > cmpxchg (*ptr, old, new) == old where appropriate.
>
> Now, the try_cmpxchg() could be an improvement on x86.

True, the concrete example is e.g. ring_buffer_record_off, where the
cmpxchg loop improves from:

78: 8b 57 08 mov 0x8(%rdi),%edx
7b: 89 d6 mov %edx,%esi
7d: 89 d0 mov %edx,%eax
7f: 81 ce 00 00 10 00 or $0x100000,%esi
85: f0 0f b1 31 lock cmpxchg %esi,(%rcx)
89: 39 d0 cmp %edx,%eax
8b: 75 eb jne 78 <ring_buffer_record_off+0x8>

to:

1bb: 89 c2 mov %eax,%edx
1bd: 81 ca 00 00 10 00 or $0x100000,%edx
1c3: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
1c7: 75 f2 jne 1bb <ring_buffer_record_off+0xb>

As can be demonstrated above, the move to a temporary reg and the
comparison with said temporary are both eliminated.

The above code is generated with gcc-10.3.1 to show the direct
benefits of the change. gcc-12+ is able to use likely/unlikely
annotations inside try_cmpxchg definition and generates fast paths
through the code (as mentioned by you in the RCU patch-set thread).

Please also note that try_cmpxchg generates better code also for
targets that do not define try_cmpxchg and use fallback code, as
reported in [1] and follow-up messages.

The API of try_cmpxhg is written in such a way that benefits loops and
also linear code. I will discuss this in the reply of PATCH 3/3.

[1] https://lore.kernel.org/lkml/871qwgmqws.fsf@xxxxxxxxxxxxxxxxxx/

Uros.