Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention

From: Heiner Kallweit
Date: Sun Nov 05 2023 - 04:26:02 EST

Next message: kernel test robot: "drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c:957:12: warning: stack frame size (2064) exceeds limit (2048) in 'vcn_v4_0_5_start'"
Previous message: Thomas Weißschuh: "[PATCH] selftests/nolibc: add script to run testsuite"
In reply to: Mirsad Todorovac: "Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention"
Next in thread: Mirsad Todorovac: "Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05.11.2023 01:15, Mirsad Todorovac wrote:
>
>
> On 11/4/23 23:37, Heiner Kallweit wrote:
>> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>>> The motivation for these helpers was the locking overhead of 130 consecutive
>>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>>> if the PHY is powered-down.
>>>
>>> To quote Heiner:
>>>
>>>      On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>      This was reported in [0] and confirmed by Realtek. Realtek provided
>>>      a sequence to fix the RX unit after PHY wakeup.
>>>
>>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>>> spin_unlock_irqrestore().
>>>
>>> Each mac ocp write is made of:
>>>
>>>      static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>                        u32 data)
>>>      {
>>>          if (rtl_ocp_reg_failure(reg))
>>>              return;
>>>
>>>          RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>      }
>>>
>>>      static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>                      u32 data)
>>>      {
>>>          unsigned long flags;
>>>
>>>          raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>          __r8168_mac_ocp_write(tp, reg, data);
>>>          raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>      }
>>>
>>> Register programming is done through RTL_W32() macro which expands into
>>>
>>>      #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>
>>> which is further (on Alpha):
>>>
>>>      extern inline void writel(u32 b, volatile void __iomem *addr)
>>>      {
>>>          mb();
>>>          __raw_writel(b, addr);
>>>      }
>>>
>>> or on i386/x86_64:
>>>
>>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>>      static inline void name(type val, volatile void __iomem *addr) \
>>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>>      "m" (*(volatile type __force *)addr) barrier); }
>>>
>>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>
>>> This obviously involves iat least a compiler barrier.
>>>
>>> mb() expands into something like this i.e. on x86_64:
>>>
>>>      #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>
>>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>
>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>> for certain time I/O takes to finish.
>>>
>>> In a sequential case of RTL register programming, the writes to RTL registers
>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>> number of bus stalls in a multicore or multi-CPU system.
>>>
>>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>>> provided to reduce lock contention:
>>>
>>>      static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>      {
>>>
>>>          ...
>>>
>>>          /* The following Realtek-provided magic fixes an issue with the RX unit
>>>           * getting confused after the PHY having been powered-down.
>>>           */
>>>
>>>          static const struct recover_8411b_info init_zero_seq[] = {
>>>              { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>>              ...
>>>          };
>>>
>>>          ...
>>>
>>>          r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>
>>>          ...
>>>
>>>      }
>>>
>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>> functions that only changed the function names and the ending of the line, so the actual
>>> hex data is unchanged.
>>>
>>> To repeat, the reason for the introduction of the original commit
>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>> the programming sequence to reach RTL NIC registers.
>>>
>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>
>>> v6:
>>> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>>
>>> v5:
>>> attempted some new optimisations, which were rejected, but not all and not completely.
>>>
>>> v4:
>>> fixed complaints as advised by Heiner and checkpatch.pl.
>>> split the patch into five sections to be more easily manipulated and reviewed
>>> introduced r8168_mac_ocp_write_seq()
>>> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>
>>> v3:
>>> removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>> avoided duplication of RTL_W32() call code as advised by Heiner.
>>>
>>> Mirsad Goran Todorovac (5):
>>>    r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>>      stalls
>>>    r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>>      spinlock stalls
>>>    r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>>      spinlocks
>>>    r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>>      spinlock contention
>>>    r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>>      spinlocks
>>>
>>> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>> 1 file changed, 150 insertions(+), 154 deletions(-)
>>>
>
> Hi, Mr. Kallweit,
>
> So good to hear so soon from you. I'm encouraged that you are positive about improving
> the speed and reducing the size of the Realtek drivers.
>
>> You still write:
>> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
>> even though you were informed that that's not the case.
>
> I was not convinced. There is no such thing as a free lunch, and there is no locking
> without affecting other cores, or locking would not make sense.
>
>> There's no actual problem, therefore your Fixes tags are incorrect.
>
> Mea culpa - my mistake, I will fix that in the next version.
>
>> Also net-next is closed at the moment.
>
> There is no problem with that, as these are only optimisation fixes, not zero day
> exploits. I am a patient person.
>
>> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
>> patch 2 is worth adding all the helpers in patch 1.
>
> I merely followed and mimed driver style from the constructions like this one:
>
>         static const struct ephy_info e_info_8168e_1[] = {
>                 { 0x00, 0x0200, 0x0100 },
>                 { 0x00, 0x0000, 0x0004 },
>                 { 0x06, 0x0002, 0x0001 },
>                 { 0x06, 0x0000, 0x0030 },
>                 { 0x07, 0x0000, 0x2000 },
>                 { 0x00, 0x0000, 0x0020 },
>                 { 0x03, 0x5800, 0x2000 },
>                 { 0x03, 0x0000, 0x0001 },
>                 { 0x01, 0x0800, 0x1000 },
>                 { 0x07, 0x0000, 0x4000 },
>                 { 0x1e, 0x0000, 0x2000 },
>                 { 0x19, 0xffff, 0xfe6c },
>                 { 0x0a, 0x0000, 0x0040 }
>         };
>
>         rtl_set_def_aspm_entry_latency(tp);
>
>         rtl_ephy_init(tp, e_info_8168e_1);
>
> Here you did not think that introducing an array reduced code readability.
>
> My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
> on each RTL_W32() write. I am convinced that a driver with less
> raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
> with more NICs and more cores.
>
Then please focus on hot paths where it actually could make a difference,
and provide numbers instead of a purely theoretical discussion.

> You said nothing to convinced me otherwise.
>
> But I am merely defending my point, this by no means implies disrespect or overlooking
> your contribution to the source as a coder and a a maintainer.
>
> Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
> and it is logical to use less locking, as locking is expensive. "barrier" in writev()
> guarantees sequential orders of write, and locking and unlocking on each read/modify/write
> is unnecessary overhead, IMHO.
>
> As the conclusion, I would like to emphasise that improving lock contention for the code
> is by no means a personal attack on the maintainer or a breach of the Code of Conduct.
>
> If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
> maybe we can cool emotions and start thinking rationally.
>
> Additionally, I would like to "inline" many functions, as I think that call/return
> sequences with stack frame generation /destruction are more expensive than inlining the
> small one liners.
>

Mainline standard is to let the compiler decide on inlining.

> But I will certainly respect your opinion on the matter as a maintainer.
>
> What I realise that I might be optimising the cold paths of the code, but from your emails
> it seems like nothing is worth optimising in this driver, and with all due respect Sir,
> I think that is dead wrong.
>

Nobody ever said that, and if you look at the history of the driver you'll see a lot of
optimizations that have been added over time. Ideally an optimization improves both:
performance and code readability
Code readability is important for maintainability and weighs higher for me than a minor
performance optimization in a code path that is very rarely used.

> Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
> in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
> reactor that malfunctioned: they did not dare to question the authority telling them that
> everything is alright.
>
> Have a nice evening, and please do not take these words as a breach of the Code or a
> personal attack. I believe we are on the same side, and that is making this driver better.
>
> The Linux kernel developer community was my last hope that this human race has a force
> to improve the mankind and make it worth surviving.
>
> But sometimes it is more honourable to go down with the ship and preserve the honour.
>
> Best regards,
> Mirsad Todorovac

Next message: kernel test robot: "drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c:957:12: warning: stack frame size (2064) exceeds limit (2048) in 'vcn_v4_0_5_start'"
Previous message: Thomas Weißschuh: "[PATCH] selftests/nolibc: add script to run testsuite"
In reply to: Mirsad Todorovac: "Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention"
Next in thread: Mirsad Todorovac: "Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]