Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local

From: Yunsheng Lin
Date: Fri Dec 06 2024 - 07:29:56 EST


On 2024/12/6 8:42, Jakub Kicinski wrote:
> On Thu, 5 Dec 2024 19:43:25 +0800 Yunsheng Lin wrote:
>> It depends on what is the callers is trying to protect by calling
>> page_pool_disable_direct_recycling().
>>
>> It seems the use case for the only user of the API in bnxt driver
>> is about reuseing the same NAPI for different page_pool instances.
>>
>> According to the steps in netdev_rx_queue.c:
>> 1. allocate new queue memory & create page_pool
>> 2. stop old rx queue.
>> 3. start new rx queue with new page_pool
>> 4. free old queue memory + destroy page_pool.
>>
>> The page_pool_disable_direct_recycling() is called in step 2, I am
>> not sure how napi_enable() & napi_disable() are called in the above
>> flow, but it seems there is no use-after-free problem this patch is
>> trying to fix for the above flow.
>>
>> It doesn't seems to have any concurrent access problem if napi->list_owner
>> is set to -1 before napi_disable() returns and the napi_enable() for the
>> new queue is called after page_pool_disable_direct_recycling() is called
>> in step 2.
>
> The fix is presupposing there is long delay between fetching of
> the NAPI pointer and its access. The concern is that NAPI gets
> restarted in step 3 after we already READ_ONCE()'ed the pointer,
> then we access it and judge it to be running on the same core.
> Then we put the page into the fast cache which will never get
> flushed.

It seems the napi_disable() is called before netdev_rx_queue_restart()
and napi_enable() and ____napi_schedule() are called after
netdev_rx_queue_restart() as there is no napi API called in the
implementation of 'netdev_queue_mgmt_ops' for bnxt driver?

If yes, napi->list_owner is set to -1 before step 1 and only set to
a valid cpu in step 6 as below:
1. napi_disable()
2. allocate new queue memory & create new page_pool.
3. stop old rx queue.
4. start new rx queue with new page_pool.
5. free old queue memory + destroy old page_pool.
6. napi_enable() & ____napi_schedule()

And there are at least three flows involved here:
flow 1: calling napi_complete_done() and set napi->list_owner to -1.
flow 2: calling netdev_rx_queue_restart().
flow 3: calling skb_defer_free_flush() with the page belonging to the old
page_pool.

The only case of page_pool_napi_local() returning true in flow 3 I can
think of is that flow 1 and flow 3 might need to be called in the softirq
of the same CPU and flow 3 might need to be called before flow 1.

It seems impossible that page_pool_napi_local() will return true between
step 1 and step 6 as updated napi->list_owner is always seen by flow 3
when they are both called in the softirq context of the same CPU or
napi->list_owner != CPU that calling flow 3, which seems like an implicit
assumption for the case of napi scheduling between different cpus too.

And old page_pool is destroyed in step 5, I am not sure if it is necessary
to call page_pool_disable_direct_recycling() in step 3 if page_pool_destroy()
already have the synchronize_rcu() in step 5 before enabling napi.

If not, maybe I am missing something here. It would be good to be more specific
about the timing window that page_pool_napi_local() returning true for the old
page_pool.

>