Re: [PATCH] net: Optimize flush calculation in inet_gro_receive()

From: Helge Deller

Date: Tue Apr 14 2026 - 03:49:28 EST

Hi Kikuyu and David,

On 4/11/26 14:09, David Laight wrote:

On Sat, 11 Apr 2026 05:19:35 +0000
Kikuyu Iwashima <kuniyu@xxxxxxxxxx> wrote:

From: Helge Deller <deller@xxxxxxxxxx>
Date: Fri, 10 Apr 2026 16:43:54 +0200

For the calculation of the flush variable, use the get_unaligned_xxx() helpers
to access only relevant bits of the IP header.

Note: Since I don't know the network details, I'm not sure if "& ~IP_DF"
(& ~0x4000) is correct, or if "& IP_OFFSET" (& 0x1FFF) should be used instead

~IP_DF is correct (MF bit needs to be checked), see

Ok, Thanks for checking!

commit db8caf3dbc77599dc90f4ea0a803cd1d97116f30
Author: Eric Dumazet <edumazet@xxxxxxxxxx>
Date: Fri May 31 11:18:10 2013

gro: should aggregate frames without DF

(which I believe would be more correct). Instead of possibly breaking things I
left it as is, but maybe some expert can check?

Signed-off-by: Helge Deller <deller@xxxxxx>

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c7731e300a44..58cad2687c2c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1479,7 +1479,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
struct sk_buff *p;
unsigned int hlen;
unsigned int off;
- int flush = 1;
+ u16 flush = 1;
int proto;
off = skb_gro_offset(skb);
@@ -1504,7 +1504,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
goto out;
NAPI_GRO_CB(skb)->proto = proto;
- flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (ntohl(*(__be32 *)&iph->id) & ~IP_DF));
+ flush = (get_unaligned_be16(&iph->tot_len) ^ skb_gro_len(skb)) |
+ (get_unaligned_be16(&iph->frag_off) & ~IP_DF);

I think here we intentionally use 32-bit loads:

commit Author: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
Date: Tue May 26 18:50:29 2009

ipv4: Use 32-bit loads for ID and length in GRO

I see, this patch is exactly the opposite of mine.

Before your patch, 32-bit load + bswap are used while
16-bit load + rol 8 after the change.

I feel the 4-byte aligned load + bswap is faster than
misaligned access + 8 times shift (Is this internally
optimised like xchg for a single word size ?)

Do you have some numbers ?

No, I don't have.
In the end it's very platform specific anyway.

Check on some architecture that doesn't support misaligned loads.
Actually, aren't the accesses aligned??

The reason why I touched this code at all, is because I got unaligned
accesses in that function on parisc.
But those unaligned accesses were triggered by parisc-specific
inline assembly, and not by this code here.

So, I believe those accesses here are aligned, and the get_unaligned_XX()
helpers make the code more readable, but are NOT necessary.

That said, I suggest to drop my patch.
It makes the code more readable, but probably will not improve speed.

Thanks for your help!
Helge

Also on ones without 32bit byteswap (some do have byteswapping
memory reads).

Also you may not want to change 'flush' to u16.
On non-x86 it may force the compiler add extra masking instructions.

David

Before:
flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb))
mov edx,DWORD PTR [rcx]
bswap edx
return skb->len - NAPI_GRO_CB(skb)->data_offset;
mov r8d,DWORD PTR [rsi+0x38]
mov r9d,DWORD PTR [rsi+0x70]
sub r9d,r8d
xor r9d,edx
| (ntohl(*(__be32 *)&iph->id) & ~IP_DF));
mov ebp,0xffbfffff
and ebp,DWORD PTR [rcx+0x4]
bswap ebp
or ebp,r9d

After:
flush = (get_unaligned_be16(&iph->tot_len) ^ skb_gro_len(skb))
movzx edx,WORD PTR [rcx+0x2]
rol dx,0x8
return skb->len - NAPI_GRO_CB(skb)->data_offset;
mov r8d,DWORD PTR [rsi+0x38]
mov r9d,DWORD PTR [rsi+0x70]
sub r9d,r8d
xor r9d,edx
| (get_unaligned_be16(&iph->frag_off) & ~IP_DF);
movzx ebp,WORD PTR [rcx+0x6]
and ebp,0xffffffbf
rol bp,0x8
or ebp,r9d