Re: [PATCH v5 6/7] r8169: Coalesce mac ocp write and modify for 8125 and 8125B start to reduce spinlocks

From: Heiner Kallweit
Date: Mon Oct 30 2023 - 11:54:07 EST


On 30.10.2023 16:02, Mirsad Todorovac wrote:
>
>
> On 10/30/23 15:02, Heiner Kallweit wrote:
>> On 29.10.2023 19:36, Mirsad Goran Todorovac wrote:
>>> Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
>>> the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
>>> spin_unlock_irqrestore() on each invocation.
>>>
>>> Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
>>> r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle
>>> pair of spin_lock_irqsave() and spin_unlock_irqrestore(), these calls
>>> reduce overall lock contention.
>>>
>>> Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
>>> Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
>>> Cc: Heiner Kallweit <hkallweit1@xxxxxxxxx>
>>> Cc: Marco Elver <elver@xxxxxxxxxx>
>>> Cc: nic_swsd@xxxxxxxxxxx
>>> Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
>>> Cc: Eric Dumazet <edumazet@xxxxxxxxxx>
>>> Cc: Jakub Kicinski <kuba@xxxxxxxxxx>
>>> Cc: Paolo Abeni <pabeni@xxxxxxxxxx>
>>> Cc: netdev@xxxxxxxxxxxxxxx
>>> Cc: linux-kernel@xxxxxxxxxxxxxxx
>>> Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@xxxxxxxxxxxx/
>>> Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@xxxxxxxxxxxx/
>>> Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@xxxxxxxxxxxx>
>>> ---
>>> v5:
>>>   added unlocked primitives to allow mac ocs modify grouping
>>>   applied coalescing of mac ocp writes/modifies for 8168ep and 8117
>>>   some formatting fixes to please checkpatch.pl
>>>
>>> v4:
>>>   fixed complaints as advised by Heiner and checkpatch.pl
>>>   split the patch into five sections to be more easily manipulated and reviewed
>>>   introduced r8168_mac_ocp_write_seq()
>>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>
>>> v3:
>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>
>>>   drivers/net/ethernet/realtek/r8169_main.c | 75 +++++++++++++----------
>>>   1 file changed, 44 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>> index 50fbacb05953..0778cd0ba2e0 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -3553,6 +3553,28 @@ DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)
>>>     static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>>>   {
>>> +    static const struct e_info_regmaskset e_info_8125_common_1[] = {
>>> +        { 0xd3e2, 0x0fff, 0x03a9 },
>>> +        { 0xd3e4, 0x00ff, 0x0000 },
>>> +        { 0xe860, 0x0000, 0x0080 },
>>> +    };
>>> +
>>> +    static const struct e_info_regmaskset e_info_8125_common_2[] = {
>>> +        { 0xc0b4, 0x0000, 0x000c },
>>> +        { 0xeb6a, 0x00ff, 0x0033 },
>>> +        { 0xeb50, 0x03e0, 0x0040 },
>>> +        { 0xe056, 0x00f0, 0x0030 },
>>> +        { 0xe040, 0x1000, 0x0000 },
>>> +        { 0xea1c, 0x0003, 0x0001 },
>>> +        { 0xe0c0, 0x4f0f, 0x4403 },
>>> +        { 0xe052, 0x0080, 0x0068 },
>>> +        { 0xd430, 0x0fff, 0x047f },
>>> +        { 0xea1c, 0x0004, 0x0000 },
>>> +        { 0xeb54, 0x0000, 0x0001 },
>>> +    };
>>> +
>>> +    unsigned long flags;
>>> +
>>>       rtl_pcie_state_l2l3_disable(tp);
>>>         RTL_W16(tp, 0x382, 0x221b);
>>> @@ -3560,47 +3582,38 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>>>       RTL_W16(tp, 0x4800, 0);
>>>         /* disable UPS */
>>> -    r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>>> +
>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>> +    __r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>>>         RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);
>>>   -    r8168_mac_ocp_write(tp, 0xc140, 0xffff);
>>> -    r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>>> +    __r8168_mac_ocp_write(tp, 0xc140, 0xffff);
>>> +    __r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>>>   -    r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
>>> -    r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
>>> -    r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
>>> +    __r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);
>>>         /* disable new tx descriptor format */
>>> -    r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>>> +    __r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>>>   -    if (tp->mac_version == RTL_GIGA_MAC_VER_63)
>>> -        r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
>>> -    else
>>> -        r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
>>> +    if (tp->mac_version == RTL_GIGA_MAC_VER_63) {
>>> +        __r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
>>> +        __r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
>>> +    } else {
>>> +        __r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
>>> +        __r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
>>> +    }
>>> +
>>> +    __r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>   -    if (tp->mac_version == RTL_GIGA_MAC_VER_63)
>>> -        r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
>>> -    else
>>> -        r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
>>> -
>>> -    r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
>>> -    r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
>>> -    r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
>>> -    r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
>>> -    r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
>>> -    r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
>>> -    r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
>>> -    r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
>>> -    r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
>>> -
>>> -    r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
>>> -    r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
>
> [1] I think we have a candidate for a squeeze here in this driver.
>
>>>       udelay(1);
>>> -    r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
>>> -    RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>>>   -    r8168_mac_ocp_write(tp, 0xe098, 0xc302);
>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>> +    __r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
>>> +    RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>>> +    __r8168_mac_ocp_write(tp, 0xe098, 0xc302);
>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>         rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);
>>>  
>>
>> All this manual locking and unlocking makes the code harder
>> to read and more error-prone. Maybe, as a rule of thumb:
>> If you can replace a block with more than 10 mac ocp ops,
>> then fine with me
> As I worked with another German developer, Mr. Frank Heckenbach from the GNU Pascal project,
> I know that Germans are pedantic and reliable :-)
>
> If this rtl_hw_start_8125_common() is called only once, then maybe every memory bus cycle
> isn't worth saving, and then maybe the additional complexity isn't worth adding (but it
> was fun doing, and it works with my NIC).
>
> AFAIK, a spin_lock_irqsave()/spin_unlock_irqrestore() isn't a free lunch as you know, and I read
> from the manuals that on modern CPUs a locked ADD $0, -128(%esp) or something takes about 50
> clock cycles, in which all cores have to wait.
>
> Doing that in storm of 10 lock/unlock pairs amounts to 500 cycles or 125 ns in the best case
> on a 4 GHz CPU.
>
> But I trust that you as the maintainer have the big picture and greater insight in the actual hw.
>
Big picture: maybe, as I've been working on r8169 for about 5 yrs
Greater insight in the actual hw: not really, because Realtek doesn't publish datasheets and
errata information. I just know the history of a lot of needed quirks in r8169, and I can recall
what we tried to improve and had to roll back because users had problems.
Most famous example is ASPM L1 + sub-state support. ASPM with r8169-driven NIC versions has a
long history of users facing slow network connectivity / tx timeouts / complete NIC stalls.
And it's still not solved with the newest NIC versions. But maybe not only Realtek is to blame.
Basically every consumer mainboard comes with a Realtek NIC, chipset issues and BIOS bugs
contribute to the mess.

> Elon Musk said that the greatest error in engineering is optimising stuff that never should
> have been written.
>
> As I like my RTL 8125b NIC to be fast (TM), I am not giving up on it.
>
Does this mean that you're not achieving the 2500Mbps line rate with r8169 as-is?

> Just recalled that you mentioned that it supports RSS. Is it feasible to add it to the r8169
> driver?
>
Feasible: definitely, as you "just" have to take the RSS-related code from r8125 and adjust it
to meet mainline code quality criteria
Question is which actual benefit it brings, and whether it's worth the additional complexity.
As a starting point you could compare performance KPI's when using r8169 vs. r8125.

> It is OK for me that we have a formal and less formal mode of communication and switch between
> them.
>
> Thank you for all your quick reviews, much obliged, thank you for partially accepting the patches,
> but I think I need to sleep over it before submitting a new version.
>
> Regards,
> Mirsad Todorovac