On Wed, Feb 12, 2025 at 1:34 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
[...]
Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.
As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.
Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
there is no triggering for net_rx_action().
In order not to call DMA APIs to do DMA unmmapping after driver
has already unbound and stall the unloading of the networking
driver, use some pre-allocated item blocks to record inflight
pages including the ones which are handed over to network stack,
so the page_pool can do the DMA unmmapping for those pages when
page_pool_destroy() is called. As the pre-allocated item blocks
need to be large enough to avoid performance degradation, add a
'item_fast_empty' stat to indicate the unavailability of the
pre-allocated item blocks.
By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to
do dma unmmaping if some pages are still handled in networking
stack when page_pool_destroy() is called, and networking stack is
also able to find the page_pool owning the page when returning
pages back into page_pool:
1. When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page
and set item->state and item->pp_netmem accordingly in order to
keep track of that page, refill from pool->release_items when
pool->hold_items is empty or use the item from pool->slow_items
when fast items run out.
2. When a page is released from the page_pool, it is able to tell
which page_pool this page belongs to by masking off the lower
bits of the pointer to page_pool_item *item, as the 'struct
page_pool_item_block' is stored in the top of a struct page. And
after clearing the pp_item->state', the item for the released page
is added back to pool->release_items so that it can be reused for
new pages or just free it when it is from the pool->slow_items.
3. When page_pool_destroy() is called, item->state is used to tell if
a specific item is being used/dma mapped or not by scanning all the
item blocks in pool->item_blocks, then item->netmem can be used to
do the dma unmmaping if the corresponding inflight page is dma
mapped.
The overhead of tracking of inflight pages is about 10ns~20ns,
which causes about 10% performance degradation for the test case
of time_bench_page_pool03_slow() in [2].
Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.
1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@xxxxxxxxxx/T/
2. https://github.com/netoptimizer/prototype-kernel
CC: Robin Murphy <robin.murphy@xxxxxxx>
CC: Alexander Duyck <alexander.duyck@xxxxxxxxx>
CC: IOMMU <iommu@xxxxxxxxxxxxxxx>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
Tested-by: Yonglong Liu <liuyonglong@xxxxxxxxxx>
+
+/* The size of item_block is always PAGE_SIZE, so that the address of item_block
+ * for a specific item can be calculated using 'item & PAGE_MASK'
+ */
+struct page_pool_item_block {
+ struct page_pool *pp;
+ struct list_head list;
+ struct page_pool_item items[];
+};
+
I think this feedback was mentioned in earlier iterations of the series:
Can we not hold a struct list_head in the page_pool that keeps track
of inflight netmems that we need to dma-unmap on page_pool_destroy?
Why do we have to modify the pp entry in the struct page and struct
net_iov?
The decision to modify pp entry in struct page and struct net_iov is
making this patchset bigger and harder to review IMO.