Re: [PATCH net 0/2] Fix NPE discovered by running bpf kselftest

From: Levi Zim
Date: Wed Dec 04 2024 - 01:49:35 EST


On 2024-12-04 09:01, Cong Wang wrote:
On Sun, Dec 01, 2024 at 09:42:08AM +0800, Levi Zim wrote:
On 2024-11-30 21:38, Levi Zim via B4 Relay wrote:
I found that bpf kselftest sockhash::test_txmsg_cork_hangs in
test_sockmap.c triggers a kernel NULL pointer dereference:
Interesting, I also ran this test recently and I didn't see such a
crash.

I am also curious about why other people or the CI didn't hit such crash.

I just did a search and find only one mention of this bug:
https://lore.kernel.org/bpf/20241020110345.1468595-1-zijianzhang@xxxxxxxxxxxxx/

Personally when trying to run test_sockmap on Arch Linux 6.12.1 kernel, I get rcu stall instead of this NPE:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:         Tasks blocked on level-0 rcu_node (CPUs 0-11): P3378
rcu:         (detected by 0, t=18002 jiffies, g=9525, q=28619 ncpus=12)
task:test_sockmap    state:R  running task     stack:0 pid:3378  tgid:3378  ppid:1168   flags:0x00004006
Call Trace:
 <TASK>
 ? __schedule+0x3b8/0x12b0
 ? get_page_from_freelist+0x366/0x1730
 ? sysvec_apic_timer_interrupt+0xe/0x90
 ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
 ? bpf_msg_pop_data+0x41e/0x690
 ? mem_cgroup_charge_skmem+0x40/0x60
 ? bpf_prog_1fca1a523ce93f38_bpf_prog4+0x23d/0x248
 ? sk_psock_msg_verdict+0x99/0x1e0
 ? tcp_bpf_sendmsg+0x42d/0x9f0
 ? sock_sendmsg+0x109/0x130
 ? splice_to_socket+0x359/0x4f0
 ? shmem_file_splice_read+0x2cd/0x300
 ? direct_splice_actor+0x51/0x130
 ? splice_direct_to_actor+0xf0/0x260
 ? __pfx_direct_splice_actor+0x10/0x10
 ? do_splice_direct+0x77/0xc0
 ? __pfx_direct_file_splice_eof+0x10/0x10
 ? do_sendfile+0x382/0x440
 ? __x64_sys_sendfile64+0xb3/0xd0
 ? do_syscall_64+0x82/0x190
 ? find_next_iomem_res+0xbe/0x130
 ? __pfx_pagerange_is_ram_callback+0x10/0x10
 ? walk_system_ram_range+0xa6/0x100
 ? __pte_offset_map+0x1b/0x180
 ? __pte_offset_map_lock+0x9e/0x130
 ? set_ptes.isra.0+0x41/0x90
 ? insert_pfn+0xba/0x210
 ? vmf_insert_pfn_prot+0x85/0xd0
 ? __do_fault+0x30/0x170
 ? do_fault+0x303/0x4c0
 ? __handle_mm_fault+0x7c2/0xfa0
 ? shmem_file_write_iter+0x5b/0x90
 ? __count_memcg_events+0x53/0xf0
 ? count_memcg_events.constprop.0+0x1a/0x30
 ? handle_mm_fault+0x1bb/0x2c0
 ? do_user_addr_fault+0x17f/0x620
 ? clear_bhb_loop+0x25/0x80
 ? clear_bhb_loop+0x25/0x80
 ? clear_bhb_loop+0x25/0x80
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>

BUG: kernel NULL pointer dereference, address: 0000000000000008
? __die_body+0x6e/0xb0
? __die+0x8b/0xa0
? page_fault_oops+0x358/0x3c0
? local_clock+0x19/0x30
? lock_release+0x11b/0x440
? kernelmode_fixup_or_oops+0x54/0x60
? __bad_area_nosemaphore+0x4f/0x210
? mmap_read_unlock+0x13/0x30
? bad_area_nosemaphore+0x16/0x20
? do_user_addr_fault+0x6fd/0x740
? prb_read_valid+0x1d/0x30
? exc_page_fault+0x55/0xd0
? asm_exc_page_fault+0x2b/0x30
? splice_to_socket+0x52e/0x630
? shmem_file_splice_read+0x2b1/0x310
direct_splice_actor+0x47/0x70
splice_direct_to_actor+0x133/0x300
? do_splice_direct+0x90/0x90
do_splice_direct+0x64/0x90
? __ia32_sys_tee+0x30/0x30
do_sendfile+0x214/0x300
__se_sys_sendfile64+0x8e/0xb0
__x64_sys_sendfile64+0x25/0x30
x64_sys_call+0xb82/0x2840
do_syscall_64+0x75/0x110
entry_SYSCALL_64_after_hwframe+0x4b/0x53

This is caused by tcp_bpf_sendmsg() returning a larger value(12289) than
size(8192), which causes the while loop in splice_to_socket() to release
an uninitialized pipe buf.

The underlying cause is that this code assumes sk_msg_memcopy_from_iter()
will copy all bytes upon success but it actually might only copy part of
it.
I am not sure what Fixes tag I should put. Git blame leads me to a refactor
commit
and I am not familiar with this part of code base. Any suggestions?
I think it is the following commit which introduced memcopy_from_iter()
(which was renamed to sk_msg_memcopy_from_iter() later):

commit 4f738adba30a7cfc006f605707e7aee847ffefa0
Author: John Fastabend <john.fastabend@xxxxxxxxx>
Date: Sun Mar 18 12:57:10 2018 -0700

bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

Please double check.

Thanks.
Thanks for your help. I will double check it.