Re:Re: [PATCH v2] mm: page_frag: Warn_on when frag_alloc size is bigger than PAGE_SIZE

From: 愚树
Date: Wed Jun 01 2022 - 08:48:48 EST


At 2022-06-01 01:28:59, "Alexander Duyck" <alexander.duyck@xxxxxxxxx> wrote:
>On Tue, May 31, 2022 at 8:47 AM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
>>
>> On Tue, 31 May 2022 23:36:22 +0800 Chen Lin wrote:
>> > At 2022-05-31 22:14:12, "Jakub Kicinski" <kuba@xxxxxxxxxx> wrote:
>> > >On Tue, 31 May 2022 22:41:12 +0800 Chen Lin wrote:
>> > >> The sample code above cannot completely solve the current problem.
>> > >> For example, when fragsz is greater than PAGE_FRAG_CACHE_MAX_SIZE(32768),
>> > >> __page_frag_cache_refill will return a memory of only 32768 bytes, so
>> > >> should we continue to expand the PAGE_FRAG_CACHE_MAX_SIZE? Maybe more
>> > >> work needs to be done
>> > >
>> > >Right, but I can think of two drivers off the top of my head which will
>> > >allocate <=32k frags but none which will allocate more.
>> >
>> > In fact, it is rare to apply for more than one page, so is it necessary to
>> > change it to support?
>>
>> I don't really care if it's supported TBH, but I dislike adding
>> a branch to the fast path just to catch one or two esoteric bad
>> callers.
>>
>> Maybe you can wrap the check with some debug CONFIG_ so it won't
>> run on production builds?
>
>Also the example used here to define what is triggering the behavior
>is seriously flawed. The code itself is meant to allow for order0 page
>reuse, and the 32K page was just an optimization. So the assumption
>that you could request more than 4k is a bad assumption in the driver
>that is making this call.
>
>So I am in agreement with Kuba. We shouldn't be needing to add code in
>the fast path to tell users not to shoot themselves in the foot.
>
>We already have code in place in __netdev_alloc_skb that is calling
>the slab allocator if "len > SKB_WITH_OVERHEAD(PAGE_SIZE)". We could
>probably just add a DEBUG wrapped BUG_ON to capture those cases where
>a driver is making that mistake with __netdev_alloc_frag_align.

Thanks for the clear explanation.
The reality is that it is not easy to capture the drivers that make such mistake.
Because memory corruption usually leads to errors on other unrelated modules.
Not long ago, we have spent a lot of time and effort to locate a issue that
occasionally occurs in different kernel modules, and finally find the root cause is
the improper use of this netdev_alloc_frag interface in DPAA net driver from NXP.
It's a miserable process.

I also found that some net drivers in the latest Linux version have this issue.
Like:
1. netdev_alloc_frag "len" may larger than PAGE_SIZE
#elif (PAGE_SIZE >= E1000_RXBUFFER_4096)
adapter->rx_buffer_len = PAGE_SIZE;
#endif

static unsigned int e1000_frag_len(const struct e1000_adapter *a)
{
return SKB_DATA_ALIGN(a->rx_buffer_len + E1000_HEADROOM) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
}

static void *e1000_alloc_frag(const struct e1000_adapter *a)
{
unsigned int len = e1000_frag_len(a);
u8 *data = netdev_alloc_frag(len);
}
"./drivers/net/ethernet/intel/e1000/e1000_main.c" 5316 --38%--

2. netdev_alloc_frag "ring->frag_size" may larger than (4096 * 3)

#define MTK_MAX_LRO_RX_LENGTH (4096 * 3)
if (rx_flag == MTK_RX_FLAGS_HWLRO) {
rx_data_len = MTK_MAX_LRO_RX_LENGTH;
rx_dma_size = MTK_HW_LRO_DMA_SIZE;
} else {
rx_data_len = ETH_DATA_LEN;
rx_dma_size = MTK_DMA_SIZE;
}

ring->frag_size = mtk_max_frag_size(rx_data_len);

for (i = 0; i < rx_dma_size; i++) {
ring->data[i] = netdev_alloc_frag(ring->frag_size);
if (!ring->data[i])
return -ENOMEM;
}
"drivers/net/ethernet/mediatek/mtk_eth_soc.c" 3344 --50%--

I will try to fix these drivers later.

Even experienced driver engineers may use this netdev_alloc_frag
interface incorrectly.
So I thought it is best to provide some prompt information of usage
error inside the netdev_alloc_frag, or it's OK to report such mistake
during system running which may caused by fragsz varies(exceeded page size).

Now, as you and Kuba mentioned earlier, "do not add code in fast path".

Can we just add code to the relatively slow path to capture the mistake
before it lead to memory corruption?
Like:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6f211d..ac60a97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5580,6 +5580,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
offset = size - fragsz;
+ BUG_ON(offset < 0);
}

nc->pagecnt_bias--;


Additional, we may modify document to clearly indicate the limits of the
input parameter fragsz.
Like:
diff --git a/Documentation/vm/page_frags.rst b/Documentation/vm/page_frags.rst
index 7d6f938..61b2805 100644
--- a/Documentation/vm/page_frags.rst
+++ b/Documentation/vm/page_frags.rst
@@ -4,7 +4,7 @@
Page fragments
==============

-A page fragment is an arbitrary-length arbitrary-offset area of memory
+A page fragment is an arbitrary-length(must <= PAGE_SIZE) arbitrary-offset area of memory
which resides within a 0 or higher order compound page.

Thanks