Re: [syzbot] [can?] KCSAN: data-race in can_send / can_send (5)

From: Vincent Mailhol
Date: Mon Mar 10 2025 - 05:29:31 EST


On Mon. 10 Mar 2025 at 03:59, Oliver Hartkopp <socketcan@xxxxxxxxxxxx> wrote:
> Hello Marc,
>
> On 09.03.25 11:46, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: 0f52fd4f67c6 Merge tag 'bcachefs-2025-03-06' of git://evil..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=12d12a54580000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=523b0e2f15224775
> > dashboard link: https://syzkaller.appspot.com/bug?extid=78ce4489b812515d5e4d
> > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/eb0d7b540c67/disk-0f52fd4f.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/51c261332ad9/vmlinux-0f52fd4f.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/38914a4790c8/bzImage-0f52fd4f.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+78ce4489b812515d5e4d@xxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > ==================================================================
> > BUG: KCSAN: data-race in can_send / can_send
> >
> > read-write to 0xffff888117566290 of 8 bytes by interrupt on cpu 0:
> > can_send+0x5a2/0x6d0 net/can/af_can.c:290
> > bcm_can_tx+0x314/0x420 net/can/bcm.c:314
> > bcm_tx_timeout_handler+0xea/0x280
> > __run_hrtimer kernel/time/hrtimer.c:1801 [inline]
> > __hrtimer_run_queues+0x20d/0x5e0 kernel/time/hrtimer.c:1865
> > hrtimer_run_softirq+0xe4/0x2c0 kernel/time/hrtimer.c:1882
> > handle_softirqs+0xbf/0x280 kernel/softirq.c:561
> > run_ksoftirqd+0x1c/0x30 kernel/softirq.c:950
> > smpboot_thread_fn+0x31c/0x4c0 kernel/smpboot.c:164
> > kthread+0x4ae/0x520 kernel/kthread.c:464
> > ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:148
> > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> >
> > read-write to 0xffff888117566290 of 8 bytes by interrupt on cpu 1:
> > can_send+0x5a2/0x6d0 net/can/af_can.c:290
> > bcm_can_tx+0x314/0x420 net/can/bcm.c:314
> > bcm_tx_timeout_handler+0xea/0x280
> > __run_hrtimer kernel/time/hrtimer.c:1801 [inline]
> > __hrtimer_run_queues+0x20d/0x5e0 kernel/time/hrtimer.c:1865
> > hrtimer_run_softirq+0xe4/0x2c0 kernel/time/hrtimer.c:1882
> > handle_softirqs+0xbf/0x280 kernel/softirq.c:561
> > run_ksoftirqd+0x1c/0x30 kernel/softirq.c:950
> > smpboot_thread_fn+0x31c/0x4c0 kernel/smpboot.c:164
> > kthread+0x4ae/0x520 kernel/kthread.c:464
> > ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:148
> > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> >
> > value changed: 0x0000000000002b9d -> 0x0000000000002b9e
> >
>
> Increased by '1' ...
>
> I assume this problem is caused by increasing the per-netdevice statistic in
>
> https://elixir.bootlin.com/linux/v6.13.6/source/net/can/af_can.c#L289
>
> pkg_stats->tx_frames++;
> pkg_stats->tx_frames_delta++;
>
> We update the statistics for the device and in this specific case the
> hrtimer fired on two CPUs resulting in a can_send() to the same netdevice.
>
> Do you agree with this quick analysis?

Ack. Same conclusion here.

> Isn't there some lock-less per-cpu safe statistic handling within netdev
> we might pick for our use-case?

I see two solutions. Either we use lock_sock(skb->sk) and
release_sock(skb->sk) or we can change the types of
can_pkg_stats->tx_frames and can_pkg_stats->tx_frames_delta from long
to atomic_long_t.

The atomic_long_t is the closest solution to a lock-less. But my
preference goes to the lock_sock() which looks more natural in this
context. And look_sock() is just a spinlock which under the hood is
also an atomic, so no big penalty either.


Yours sincerely,
Vincent Mailhol