Re: [PATCH net-next 10/11] net: macb: use context swapping in .set_ringparam()

From: Théo Lebrun

Date: Fri Apr 03 2026 - 05:10:05 EST

On Thu Apr 2, 2026 at 6:31 PM CEST, Théo Lebrun wrote:
> On Thu Apr 2, 2026 at 1:29 PM CEST, Nicolai Buchwitz wrote:
>> On 1.4.2026 18:39, Théo Lebrun wrote:
>>> ethtool_ops.set_ringparam() is implemented using the primitive close /
>>> update ring size / reopen sequence. Under memory pressure this does not
>>> fly: we free our buffers at close and cannot reallocate new ones at
>>> open. Also, it triggers a slow PHY reinit.
>>>
>>> Instead, exploit the new context mechanism and improve our sequence to:
>>> - allocate a new context (including buffers) first
>>> - if it fails, early return without any impact to the interface
>>> - stop interface
>>> - update global state (bp, netdev, etc)
>>> - pass buffer pointers to the hardware
>>> - start interface
>>> - free old context.
>>>
>>> The HW disable sequence is inspired by macb_reset_hw() but avoids
>>> (1) setting NCR bit CLRSTAT and (2) clearing register PBUFRXCUT.
>>>
>>> The HW re-enable sequence is inspired by macb_mac_link_up(), skipping
>>> over register writes which would be redundant (because values have not
>>> changed).
>>>
>>> The generic context swapping parts are isolated into helper functions
>>> macb_context_swap_start|end(), reusable by other operations
>>> (change_mtu,
>>> set_channels, etc).
>>>
>>> Signed-off-by: Théo Lebrun <theo.lebrun@xxxxxxxxxxx>
>>> ---
>>> drivers/net/ethernet/cadence/macb_main.c | 89
>>> +++++++++++++++++++++++++++++---
>>> 1 file changed, 82 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c
>>> b/drivers/net/ethernet/cadence/macb_main.c
>>> index 42b19b969f3e..543356554c11 100644
>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>> @@ -2905,6 +2905,76 @@ static struct macb_context
>>> *macb_context_alloc(struct macb *bp,
>>> return ctx;
>>> }
>>>
>>> +static void macb_context_swap_start(struct macb *bp)
>>> +{
>>> + struct macb_queue *queue;
>>> + unsigned int q;
>>> + u32 ctrl;
>>> +
>>> + /* Disable software Tx, disable HW Tx/Rx and disable NAPI. */
>>> +
>>> + netif_tx_disable(bp->netdev);
>>> +
>>> + ctrl = macb_readl(bp, NCR);
>>> + macb_writel(bp, NCR, ctrl & ~(MACB_BIT(RE) | MACB_BIT(TE)));
>>> +
>>> + macb_writel(bp, TSR, -1);
>>> + macb_writel(bp, RSR, -1);
>>> +
>>> + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>>> + queue_writel(queue, IDR, -1);
>>> + queue_readl(queue, ISR);
>>> + if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
>>> + queue_writel(queue, ISR, -1);
>>> + }
>>> +
>>> + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>>> + napi_disable(&queue->napi_rx);
>>> + napi_disable(&queue->napi_tx);
>>> + }
>>
>> tx_error_task, hresp_err_bh_work, and tx_lpi_work all dereference
>> bp->ctx and could race with the pointer swap in swap_end.
>> macb_close() cancels at least tx_lpi_work here. Should these be
>> flushed too?
>
> This is a large topic! While trying to find a solution as part of this
> series I am noticing many race conditions. With this context series we
> worsen some (by introducing bp->ctx NULL ptr dereference).
>
> Let's start by identifying all schedule-able contexts involved:
> - #1 any request from userspace, too many callbacks to list
> - #2 NAPI softirq or kthread context, macb_{rx,tx}_poll()
> - #3 bp->hresp_err_bh_work / macb_hresp_error_task()
> - #4 bp->tx_lpi_work / macb_tx_lpi_work_fn()
> - #5 queue->tx_error_task / macb_tx_error_task()
> - #6 IRQ context, macb_interrupt()
>
> Some race conditions:
>
> - #1 macb_close() doesn't cancel & wait upon #3 hresp_err_bh_work.
> They could race, especially as #3 doesn't grab bp->lock. One race
> example: #3 BP HRESP starts the interface after it has been closed
> and buffers freed. RBQP/TBQP are not reset so MACB would occur
> memory corruption on Rx and transmit memory content.
>
> - #1 macb_close() doesn't cancel & wait upon #5 tx_error_task. #5 does
> grab bp->lock but that doesn't make it much safer. One race example:
> same as above, restart of interface with ghost ring buffers.
>
> - #3 hresp_err_bh_work could collide with anything as it does no
> locking, especially #1 (xmit for example) or #2 (NAPI). It is less
> likely to collide with #6 IRQ because it starts by disabling those
> but there is a possibility of the IRQ having already triggered and
> macb_interrupt() already running in parallel of
> macb_hresp_error_task().
>
> - #5 queue->tx_error_task writes to Tx head/tail inside bp->lock.
> #1 macb_start_xmit() modifies those too, but inside
> queue->tx_ptr_lock. Oops. There probably are other places modifying
> head/tail or any other Tx queue value without queue->tx_ptr_lock.
>
> - #5 macb_tx_error_task() tries to gently disable TX but if it
> times-out then it uses the global switch (TE field in NCR
> register). That sounds racy with #2 NAPI that doesn't grab bp->lock
> and would probably break if the interface is shutdown under its
> feet.
>
> I don't see much more. To fix all that, someone ought to exhaustively go
> through all tasks (#1-6 above) & all shared data and reason one by one.
> Who will be that someone? ;-) But that sounds pretty unrelated to the
> series at hand, no?
>
> I'd agree that some locking of bp->lock around the swap operation would
> improve the series, and I'll add that in V2 for sure!

After some sleep, I feel like my message was a bit rough. To clarify
what I plan for V2:
- grab bp->lock on swap to protect us against some of #1 userspace and
all of #6 IRQ.
- disabling #2 NAPI on swap is already done
- disable all three BH features on swap

That will not fix everything listed above.

On top, we should:
- check/revise our locking strategy for almost all codepaths,
- check all BH features are disabled and blocked upon in the right
codepaths,
- in many bp->lock critical section, we should early exit if !bp->ctx.

>
>>
>>> +}
>>> +
>>> +static void macb_context_swap_end(struct macb *bp,
>>> + struct macb_context *new_ctx)
>>> +{
>>> + struct macb_context *old_ctx;
>>> + struct macb_queue *queue;
>>> + unsigned int q;
>>> + u32 ctrl;
>>> +
>>> + /* Swap contexts & give buffer pointers to HW. */
>>> +
>>> + old_ctx = bp->ctx;
>>> + bp->ctx = new_ctx;
>>> + macb_init_buffers(bp);
>>> +
>>> + /* Start NAPI, HW Tx/Rx and software Tx. */
>>> +
>>> + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>>> + napi_enable(&queue->napi_rx);
>>> + napi_enable(&queue->napi_tx);
>>> + }
>>> +
>>> + if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
>>> + for (q = 0, queue = bp->queues; q < bp->num_queues;
>>> + ++q, ++queue) {
>>> + queue_writel(queue, IER,
>>> + bp->rx_intr_mask |
>>> + MACB_TX_INT_FLAGS |
>>> + MACB_BIT(HRESP));
>>> + }
>>> + }
>>> +
>>> + ctrl = macb_readl(bp, NCR);
>>> + macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));
>>> +
>>> + netif_tx_start_all_queues(bp->netdev);
>>> +
>>> + /* Free old context. */
>>> +
>>> + macb_free_consistent(old_ctx);
>>
>> 1. kfree(old_ctx) is missing. The context struct itself leaks on
>> every swap.
>
> Agreed.
>
>> 2. macb_close() calls netdev_tx_reset_queue() for each queue.
>> Shouldn't the swap do the same? BQL accounting will be stale
>> after switching to a fresh context.
>
> I explicitely left that out as I thought DQL would benefit from keeping
> past context of the traffic. But indeed as we start afresh from a new
> set of buffers we should reset DQL. fbnic, pointed out as an good
> example by Jakub recently, does that.
>
>>
>> 3. macb_configure_dma() is not called after the swap. For
>> set_ringparam this is probably fine since rx_buffer_size
>> does not change, but this becomes a problem in patch 11.
>
> Indeed, I had missed it took bp->ctx->rx_buffer_size as a parameter.
> Will fix.

Thanks,

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com