Re: [PATCH v3] bpf, sockmap: keep sk_msg copy state in sync
From: bot+bpf-ci
Date: Wed May 20 2026 - 07:12:58 EST
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9590877b0714..223d78708486 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
[ ... ]
> @@ -2794,6 +2835,8 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
> {
> struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
> u32 new, i = 0, l = 0, space, copy = 0, offset = 0;
> + bool sge_copy = false, nsge_copy = false, nnsge_copy = false;
> + bool rsge_copy = false;
> u8 *raw, *to, *from;
> struct page *page;
>
> @@ -2866,6 +2909,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
> sk_msg_iter_var_prev(i);
> psge = sk_msg_elem(msg, i);
> rsge = sk_msg_elem_cpy(msg, i);
> + rsge_copy = sk_msg_elem_is_copy(msg, i);
>
> psge->length = start - offset;
> rsge.length -= psge->length;
Could copy + len overflow and cause an out-of-bounds write? In
bpf_msg_push_data(), the len parameter is an unbounded u32 value passed
from a bpf program. When calculating the allocation size, copy + len can
overflow a 32-bit integer:
page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
get_order(copy + len));
If copy + len wraps around, alloc_pages() will allocate an undersized
buffer. Later, the destination pointer for shifting tail data is
calculated as:
to = raw + front + len
Due to len being a large value, front + len wraps around and results in a
large positive 32-bit offset (zero-extended to 64-bit). The subsequent
memcpy(to, from, back) then performs a controlled out-of-bounds write well
past the allocated page.
Does bpf_msg_push_data() leak uninitialized kernel memory? The function
allocates a new memory page:
page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
get_order(copy + len));
Since the allocation does not include the __GFP_ZERO flag, the new len
bytes inserted into the message are never explicitly zeroed or initialized.
If the bpf program does not manually overwrite these bytes and the message
is transmitted over a socket, this could leak uninitialized kernel memory
to the network or userspace.
Could the ring buffer state become corrupted when data is appended exactly
at the end of the message and the ring buffer is full?
If data is appended exactly at the end of the message (start == offset +
l), the loop finishes with i pointing to the unused slot msg->sg.end. If
the ring buffer is full (!space), the coalescing fallback logic reads:
copy = msg->sg.data[i].length
Because i points to an unused slot, its length is 0, making copy == 0, and
the fallback coalescing block is bypassed. The function then
unconditionally calls:
sk_msg_iter_next(msg, end)
which increments msg->sg.end. Since the ring buffer was already full,
advancing msg->sg.end causes it to wrap around and equal msg->sg.start.
This could corrupt the ring buffer state and overwrite the first element
without calling put_page(), leading to page leaks.
[ ... ]
> @@ -2891,23 +2935,34 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
> /* Shift one or two slots as needed */
> sge = sk_msg_elem_cpy(msg, new);
> sg_unmark_end(&sge);
> + sge_copy = sk_msg_elem_is_copy(msg, new);
>
> nsge = sk_msg_elem_cpy(msg, i);
> + nsge_copy = sk_msg_elem_is_copy(msg, i);
> if (rsge.length) {
> sk_msg_iter_var_next(i);
> nnsge = sk_msg_elem_cpy(msg, i);
> + nnsge_copy = sk_msg_elem_is_copy(msg, i);
> sk_msg_iter_next(msg, end);
> }
Should rsge.offset be adjusted after splitting? When rsge is created
from a split entry, the code does not include:
rsge.offset += start - offset;
to correctly adjust the offset. The rsge points to the tail portion of
the split entry but retains the original offset, which appears incorrect.
This could cause the residual entry to point to the wrong data within
the page.
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26156967283