Re: [PATCH v5 6/7] r8169: Coalesce mac ocp write and modify for 8125 and 8125B start to reduce spinlocks

From: Akira Yokosawa
Date: Tue Oct 31 2023 - 04:24:00 EST


Hello Mirsad,

[most CCs dropped]

I'm responding to your comment quoted below. It caught eyes of me
who happens to be a reviewer of LKMM and a LaTeX advisor to perfbook.

On Mon, 30 Oct 2023 16:02:28 +0100, Mirsad Todorovac wrote:
> On 10/30/23 15:02, Heiner Kallweit wrote:
[...]
>>
>> All this manual locking and unlocking makes the code harder
>> to read and more error-prone. Maybe, as a rule of thumb:
>> If you can replace a block with more than 10 mac ocp ops,
>> then fine with me
> As I worked with another German developer, Mr. Frank Heckenbach from the GNU Pascal project,
> I know that Germans are pedantic and reliable :-)
>
> If this rtl_hw_start_8125_common() is called only once, then maybe every memory bus cycle
> isn't worth saving, and then maybe the additional complexity isn't worth adding (but it
> was fun doing, and it works with my NIC).
>
> AFAIK, a spin_lock_irqsave()/spin_unlock_irqrestore() isn't a free lunch as you know, and I read
> from the manuals that on modern CPUs a locked ADD $0, -128(%esp) or something takes about 50
> clock cycles, in which all cores have to wait.

Do you mean, while one of x86 CPUs is executing "lock; addl $0, -4(%esp)"
aka smp_mb(), bus locking prevents all the other CPUs in the system
connected to the bus from doing any memory accesses ???

If it is the case, I believe you are missing the state of the art
optimization of x86 memory system architecture, where most of atomic
operations are done using cache locking. Bus locking is used only
when it is unavoidable.

Hint: A short introduction can be found at stackoverflow.com [1].
Quote of (then) section 7.1.4 from Intel's SDM vol 3A in the answer
should suffice.

[1]: https://stackoverflow.com/questions/3339141/x86-lock-question-on-multi-core-cpus

A reachable link to Intel SDM should be somewhere in perfbook's bibliography.
The relevant section in Vol 3A is now "2.8.5 Controlling the Processor".

Hope this helps,
Akira

>
> Doing that in storm of 10 lock/unlock pairs amounts to 500 cycles or 125 ns in the best case
> on a 4 GHz CPU.
>
> But I trust that you as the maintainer have the big picture and greater insight in the actual hw.