Re: [PATCH net-next v2 05/15] mm: page_frag: use initial zero offset for page_frag_alloc_align()

From: Yunsheng Lin
Date: Wed Apr 17 2024 - 09:18:06 EST


On 2024/4/16 23:51, Alexander H Duyck wrote:
> On Tue, 2024-04-16 at 21:11 +0800, Yunsheng Lin wrote:
>> On 2024/4/16 7:55, Alexander H Duyck wrote:
>>> On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote:
>>>> We are above to use page_frag_alloc_*() API to not just
>>>> allocate memory for skb->data, but also use them to do
>>>> the memory allocation for skb frag too. Currently the
>>>> implementation of page_frag in mm subsystem is running
>>>> the offset as a countdown rather than count-up value,
>>>> there may have several advantages to that as mentioned
>>>> in [1], but it may have some disadvantages, for example,
>>>> it may disable skb frag coaleasing and more correct cache
>>>> prefetching
>>>>
>>>> We have a trade-off to make in order to have a unified
>>>> implementation and API for page_frag, so use a initial zero
>>>> offset in this patch, and the following patch will try to
>>>> make some optimization to aovid the disadvantages as much
>>>> as possible.
>>>>
>>>> 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@xxxxxxxxx/
>>>>
>>>> CC: Alexander Duyck <alexander.duyck@xxxxxxxxx>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
>>>> ---
>>>> mm/page_frag_cache.c | 31 ++++++++++++++-----------------
>>>> 1 file changed, 14 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
>>>> index 64993b5d1243..dc864ee09536 100644
>>>> --- a/mm/page_frag_cache.c
>>>> +++ b/mm/page_frag_cache.c
>>>> @@ -65,9 +65,8 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
>>>> unsigned int fragsz, gfp_t gfp_mask,
>>>> unsigned int align_mask)
>>>> {
>>>> - unsigned int size = PAGE_SIZE;
>>>> + unsigned int size, offset;
>>>> struct page *page;
>>>> - int offset;
>>>>
>>>> if (unlikely(!nc->va)) {
>>>> refill:
>>>> @@ -75,10 +74,6 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
>>>> if (!page)
>>>> return NULL;
>>>>
>>>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> - /* if size can vary use size else just use PAGE_SIZE */
>>>> - size = nc->size;
>>>> -#endif
>>>> /* Even if we own the page, we do not use atomic_set().
>>>> * This would break get_page_unless_zero() users.
>>>> */
>>>> @@ -87,11 +82,18 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
>>>> /* reset page count bias and offset to start of new frag */
>>>> nc->pfmemalloc = page_is_pfmemalloc(page);
>>>> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>>>> - nc->offset = size;
>>>> + nc->offset = 0;
>>>> }
>>>>
>>>> - offset = nc->offset - fragsz;
>>>> - if (unlikely(offset < 0)) {
>>>> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> + /* if size can vary use size else just use PAGE_SIZE */
>>>> + size = nc->size;
>>>> +#else
>>>> + size = PAGE_SIZE;
>>>> +#endif
>>>> +
>>>> + offset = ALIGN(nc->offset, -align_mask);
>>>
>>> I am not sure if using -align_mask here with the ALIGN macro is really
>>> to your benefit. I would be curious what the compiler is generating.
>>>
>>> Again, I think you would be much better off with:
>>> offset = __ALIGN_KERNEL_MASK(nc->offset, ~align_mask);
>>>
>>> That will save you a number of conversions as the use of the ALIGN
>>> macro gives you:
>>> offset = (nc->offset + (-align_mask - 1)) & ~(-align_mask -
>>> 1);
>>>
>>> whereas what I am suggesting gives you:
>>> offset = (nc->offset + ~align_mask) & ~(~align_mask));
>>>
>>> My main concern is that I am not sure the compiler will optimize around
>>> the combination of bit operations and arithmetic operations. It seems
>>> much cleaner to me to stick to the bitwise operations for the alignment
>>> than to force this into the vhost approach which requires a power of 2
>>> aligned mask.
>>
>> My argument about the above is in [1]. But since you seems to not be working
>> through the next patch yet, I might just do it as you suggested in the next
>> version so that I don't have to repeat my argument again:(
>>
>> 1. https://lore.kernel.org/all/df826acf-8867-7eb6-e7f0-962c106bc28b@xxxxxxxxxx/
>
> Sorry, I didn't have time to go digging through the mailing list to
> review all the patches from the last set. I was only Cced on a few of

I thought adding 'CC: Alexander Duyck <alexander.duyck@xxxxxxxxx>' in
the cover letter would enable the git sendmail to send all the patches
to a specific email, apparently it did not. And I seems to only add that
in rfc and v1, but forgot to add it in the newest v2 version:(

> them as I recall. As you know I have the fbnic patches I also have been
> trying to get pushed out so that was my primary focus the last couple
> weeks.

Understood.

>
> That said, this just goes into my earlier complaints. You are now
> optimizing for the non-aligned paths. There are few callers that are
> asking for this to provide non-aligned segments. In most cases they are

I suppose that 'optimizing for the non-aligned paths' is referring to
doing the data alignment in a inline helper for aligned API caller and
avoid doing the data alignment for non-aligned API caller in the patch 6?

For the existing user, it seems there are more callers for the non-aligned
API than callers for aligned API:

Referenced in 13 files:
https://elixir.bootlin.com/linux/v6.7-rc8/A/ident/napi_alloc_frag

Referenced in 2 files:
https://elixir.bootlin.com/linux/v6.7-rc8/A/ident/napi_alloc_frag_align

Referenced in 15 files:
https://elixir.bootlin.com/linux/v6.7-rc8/A/ident/netdev_alloc_frag

No references found in the database
https://elixir.bootlin.com/linux/v6.7-rc8/A/ident/netdev_alloc_frag_align

Referenced in 6 files:
https://elixir.bootlin.com/linux/v6.7-rc8/A/ident/page_frag_alloc

Referenced in 3 files:
https://elixir.bootlin.com/linux/v6.7-rc8/A/ident/page_frag_alloc_align

And we are adding new users mostly asking for non-aligned segments in
patch 13.

I would argue that it is not about optimizing for the non-aligned paths,
it is about avoid doing the data alignment operation for non-aligned API.

> at least cache aligned. Specifically the __netdev_alloc_frag_align and
> __napi_alloc_frag_align are aligning things at a minimum to
> SMP_CACHE_BYTES by aligning the fragsz argument using SKB_DATA_ALIGN.

It seems the above is doing the aligning operation for fragsz, most of
callers are calling __netdev_alloc_frag_align() and __napi_alloc_frag_align()
with align_mask being ~0u.

> Perhaps it would be better to actually incorporate that alignment
> guarantee into the calls themselves by doing an &= with the align_mask
> request for those two functions to make this more transparent.

Did you means doing something like below for fragsz too?
fragsz = __ALIGN_KERNEL_MASK(fragsz, ~align_mask);

>
>>>
>>> Also the old code was aligning on the combination of offset AND fragsz.
>>> This new logic is aligning on offset only. Do we run the risk of
>>> overwriting blocks of neighbouring fragments if two users of
>>> napi_alloc_frag_align end up passing arguments that have different
>>> alignment values?
>>
>> I am not sure I understand the question here.
>> As my understanding, both the old code and new code is aligning on
>> the offset, and both might have space reserved before the offset
>> due to aligning. The memory returned to the caller is in the range
>> of [offset, offset + fragsz). Am I missing something obvious here?
>
> My main concern is that by aligning offset - fragsz by the alignment
> mask we were taking care of all our variables immediately ourselves. If
> we didn't provide a correct value it was all traceable to one call as
> the assumption was that fragsz would be a multiple of the alignment
> value.
>
> With your change the alignment is done in the following call. So now it
> splits up the alignment of fragsz from the alignment of the offset. As
> such we will probably need to add additional overhead to guarantee
> fragsz is a multiple of the alignment.

I am not thinking it through how the above will affect the API caller yet
if different caller is passing different alignment for the same
'page_frag_cache' instance, does it cause some cache bouncing or dma
issue if used for dma?

I am supposing it depends on what alignment semantics are we providing
here:
1. Ensure alignment for both offset and fragsz.
2. Ensure alignment for offset only.
3. Ensure alignment for fragsz only.

It seems you are in favour of option 1?
I am supposing it is a balance between performance and API flexibility
here? If it is possible to enforce the caller to use the same alignment
for the same 'page_frag_cache' instance, and give a warning if it is
not using the same alignment? So that we only need to ensure alignment
for offset or fragsz, but not both of them.

I am not sure if there is a strong use case to support both alignment
for offset and fragsz, we might create a new API for it if it is a
strong use case?