Re: [PATCH net v2 2/2] net: ethernet: oa_tc6: fix tx skb race condition between reference pointers

From: Paolo Abeni
Date: Wed Nov 27 2024 - 07:11:23 EST


On 11/27/24 11:49, Parthiban.Veerasooran@xxxxxxxxxxxxx wrote:
> On 26/11/24 4:11 pm, Paolo Abeni wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 11/22/24 11:21, Parthiban Veerasooran wrote:
>>> There are two skb pointers to manage tx skb's enqueued from n/w stack.
>>> waiting_tx_skb pointer points to the tx skb which needs to be processed
>>> and ongoing_tx_skb pointer points to the tx skb which is being processed.
>>>
>>> SPI thread prepares the tx data chunks from the tx skb pointed by the
>>> ongoing_tx_skb pointer. When the tx skb pointed by the ongoing_tx_skb is
>>> processed, the tx skb pointed by the waiting_tx_skb is assigned to
>>> ongoing_tx_skb and the waiting_tx_skb pointer is assigned with NULL.
>>> Whenever there is a new tx skb from n/w stack, it will be assigned to
>>> waiting_tx_skb pointer if it is NULL. Enqueuing and processing of a tx skb
>>> handled in two different threads.
>>>
>>> Consider a scenario where the SPI thread processed an ongoing_tx_skb and
>>> it moves next tx skb from waiting_tx_skb pointer to ongoing_tx_skb pointer
>>> without doing any NULL check. At this time, if the waiting_tx_skb pointer
>>> is NULL then ongoing_tx_skb pointer is also assigned with NULL. After
>>> that, if a new tx skb is assigned to waiting_tx_skb pointer by the n/w
>>> stack and there is a chance to overwrite the tx skb pointer with NULL in
>>> the SPI thread. Finally one of the tx skb will be left as unhandled,
>>> resulting packet missing and memory leak.
>>> To overcome the above issue, protect the moving of tx skb reference from
>>> waiting_tx_skb pointer to ongoing_tx_skb pointer so that the other thread
>>> can't access the waiting_tx_skb pointer until the current thread completes
>>> moving the tx skb reference safely.
>>
>> A mutex looks overkill. Why don't you use a spinlock? why locking only
>> one side (the writer) would be enough?
> Ah my bad, missed to protect tc6->waiting_tx_skb = skb. So that it will
> become like below,
>
> mutex_lock(&tc6->tx_skb_lock);
> tc6->waiting_tx_skb = skb;
> mutex_unlock(&tc6->tx_skb_lock);
>
> As both are not called from atomic context and they are allowed to
> sleep, I used mutex rather than spinlock.
>>
>> Could you please report the exact sequence of events in a time diagram
>> leading to the bug, something alike the following?
>>
>> CPU0 CPU1
>> oa_tc6_start_xmit
>> ...
>> oa_tc6_spi_thread_handler
>> ...
> Good case:
> ----------
> Consider waiting_tx_skb is NULL.
>
> Thread1 (oa_tc6_start_xmit) Thread2 (oa_tc6_spi_thread_handler)
> --------------------------- -----------------------------------
> - if waiting_tx_skb is NULL
> - waiting_tx_skb = skb
> - if ongoing_tx_skb is NULL
> - ongoing_tx_skb = waiting_tx_skb
> - waiting_tx_skb = NULL
> ...
> - ongoing_tx_skb = NULL
> - if waiting_tx_skb is NULL
> - waiting_tx_skb = skb
> - if ongoing_tx_skb is NULL
> - ongoing_tx_skb = waiting_tx_skb
> - waiting_tx_skb = NULL
> ...
> - ongoing_tx_skb = NULL
> ....
>
> Bad case:
> ---------
> Consider waiting_tx_skb is NULL.
>
> Thread1 (oa_tc6_start_xmit) Thread2 (oa_tc6_spi_thread_handler)
> --------------------------- -----------------------------------
> - if waiting_tx_skb is NULL
> - waiting_tx_skb = skb
> - if ongoing_tx_skb is NULL

AFAICS, if 'waiting_tx_skb == NULL and Thread2 is in
oa_tc6_spi_thread_handler()/oa_tc6_prepare_spi_tx_buf_for_tx_skbs()
then ongoing_tx_skb can not be NULL, due to the previous check in:

https://elixir.bootlin.com/linux/v6.12/source/drivers/net/ethernet/oa_tc6.c#L1064

This looks like a single reader/single write scenarios that does not
need any lock to ensure consistency.

Do you observe any memory leak in real life scenarios?

BTW it looks like both oa_tc6_start_xmit and oa_tc6_spi_thread_handler
are possibly lacking memory barriers to avoid missing wake-ups.

Cheers,

Paolo