Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls

From: Mirsad Todorovac
Date: Mon Oct 30 2023 - 23:35:38 EST

Next message: Florian Fainelli: "Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size"
Previous message: Chen-Yu Tsai: "Re: [PATCH 1/4] drm/panfrost: Implement ability to turn on/off GPU clocks in suspend"
In reply to: Andrew Lunn: "Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls"
Next in thread: Andrew Lunn: "Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/31/23 02:21, Andrew Lunn wrote:

I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
in each single one of the drivers could amount to some degrading of overall performance and
latency in a multicore system.

For optimisations, we like to see benchmark results which show some
improvements. Do you have any numbers?

Hi, Andrew,

Thank you for your interest in RTL NIC driver optimisations.

My knowledge about the timing costs of synchronisation is mostly theoretical.

By the table below, the cost of LOCK CHPXCHG m,616/32/64 is 52 cycles (in case the
memory location is already in the L3 cache I suppose, cache miss can makes things worse,
but we are hitting the same lock repeatedly).

So, we have (in the best case) 52 clocks for lock, 52 for unlock (for the uncontentended
case).

This means 104 cycle x 130 sequential lock+unlock pairs = 13520 clk ~ 3.38 us on a 4 GHz
machine (5.633 us on a 2.4 GHz CPU) while the multicore system does nothing but locking
and unclocking after therecovery from the loss of PHY.

We are talking about the RTl8411b.

The other case it the rather new 8125/8125b.

static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
{
rtl_pcie_state_l2l3_disable(tp);

RTL_W16(tp, 0x382, 0x221b);
RTL_W8(tp, 0x4500, 0);
RTL_W16(tp, 0x4800, 0);

/* disable UPS */
r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);

RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);

r8168_mac_ocp_write(tp, 0xc140, 0xffff);
r8168_mac_ocp_write(tp, 0xc142, 0xffff);

r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);

/* disable new tx descriptor format */
r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);

if (tp->mac_version == RTL_GIGA_MAC_VER_63)
r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
else
r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);

if (tp->mac_version == RTL_GIGA_MAC_VER_63)
r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
else
r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);

r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);

r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
udelay(1);
r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);

r8168_mac_ocp_write(tp, 0xe098, 0xc302);

rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);

if (tp->mac_version == RTL_GIGA_MAC_VER_63)
rtl8125b_config_eee_mac(tp);
else
rtl8125a_config_eee_mac(tp);

rtl_disable_rxdvgate(tp);
}

There are 22 calls to mac ocp write/modify, each having a raw_spin_lock_irqsave()/
raw_spin_unlock_irqrestore() pair.

The minimal timing cost is 22x104 = 2288 cycles or a total of 0.572 us on a 4 GHz CPU,
preventing all cores from the memory access. (0.953 us on a 2.4 GHz system).

This does happen only at the startup apparently, but also at each rtl_reset_work(tp)
and rtl8169_up(tp), rtl_open() at driver load, and rtl8169_runtime_resume() called by
the rtl8169_resume().

The revised version does only 2 pairs of raw_spin_lock_irqsave()/
raw_sping_unlock_irqrestore(), but I admit it might be harder to read and maintain.

My knowledge of the Linux kernel network stack isn't that deep, and I can't tell how
often rtl8169_resume() is called in real life and how to reproduce and benchmark that
call.

This is a sort of a "loophole optimisation", I don't have the numbers on the impact
on the entire Linux on i.e. 16 core/32 threads Ryzen 9 7950X CPU.

Probably things would be worse on a 64 core CPU, because the 5.633 us and 0.953 us
would have to be multiplied by the number of blocked cores, amounting to estimated
total CPU cost of 360 us and 61 us respectively. Per NIC reset.

However, I am relatively new to the r8169 driver, I came across while analysing
KCSAN reports since the beginning of August. So ATM I don't have any benchmarks
to confirm the theoretical findings.

I realise that maintainers have to cherry pick patches or the kernel of +35M lines
would become way too cluttered and unmaintainable.

NOTE: The numbers are only valid for a x86_64 system.

Regards,
Mirsad

SOURCE AND REFERENCES:

Synchronization (Source: [1])
===============================================
instruction operands ops latency
===============================================
LOCK ADD m,r 1 ~55
XADD m,r 4 10
LOCK XADD m,r 4 ~51
CMPXCHG m8,r8 5 15
LOCK CMPXCHG m8,r8 5 ~51
CMPXCHG m,r16/32/64 6 14
LOCK CMPXCHG m,r16/32/64 6 ~52
CMPXCHG8B m64 18 15
LOCK CMPXCHG8B m64 18 ~53
CMPXCHG16B m128 22 52
LOCK CMPXCHG16B m128 22 ~94
===============================================

Explanation of column headings:
Instruction:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.

Operands:
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Ops:
Number of macro-operations issued from instruction decoder to schedulers. In-
structions with more than 2 macro-operations use microcode.
Latency:
This is the delay that the instruction generates in a dependency chain. The num-
bers are minimum values. Cache misses, misalignment, and exceptions may in-
crease the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).

[1] TL;DR https://www.agner.org/optimize/instruction_tables.pdf
[2] TL;DR http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

Next message: Florian Fainelli: "Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size"
Previous message: Chen-Yu Tsai: "Re: [PATCH 1/4] drm/panfrost: Implement ability to turn on/off GPU clocks in suspend"
In reply to: Andrew Lunn: "Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls"
Next in thread: Andrew Lunn: "Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]