Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local
From: Joe Damato
Date: Thu Sep 26 2024 - 16:06:32 EST
On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote:
> page_pool page may be freed from skb_defer_free_flush() to
> softirq context, it may cause concurrent access problem for
> pool->alloc cache due to the below time window, as below,
> both CPU0 and CPU1 may access the pool->alloc cache
> concurrently in page_pool_empty_alloc_cache_once() and
> page_pool_recycle_in_cache():
>
> CPU 0 CPU1
> page_pool_destroy() skb_defer_free_flush()
> . .
> . page_pool_put_unrefed_page()
> . .
> . allow_direct = page_pool_napi_local()
> . .
> page_pool_disable_direct_recycling() .
> . .
> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
>
> Use rcu mechanism to avoid the above concurrent access problem.
>
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
>
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@xxxxxxxxxx/T/
>
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
> CC: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
Sorry for the noise, but I hit an assert in page_pool_unref_netmem
and I am trying to figure out if it is related to what you all are
debugging? I thought it might be, but if not, my apologies.
Just in case it is, I've put the backtrace on github [1]. I
triggered this while testing an RFC [2] I've been working on. Please
note, the RFC posted publicly does not currently apply cleanly to
net-next and has some bugs I've fixed in my v4. I had planned to
send the v4 early next week and mention the page pool issue I am
hitting.
After triggering the assert in [1], I tried applying the patches of
this series and retesting the RFC v4 I have queued locally. When I
did that, I hit a new assertion page_pool_destroy [3].
There are a few possibilities:
1. I am hitting the same issue you are hitting
2. I am hitting a different issue caused by a bug I introduced
3. I am hitting a different page pool issue entirely
In case of 2 and 3, my apologies for the noise.
In case of 1: If you think I am hitting the same issue as you are
trying to solve, I can reliably reproduce this with my RFC v4 and
would be happy to test any patches meant to fix the issue.
[1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt
[2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@xxxxxxxxxx/#r
[3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt