Re: [syzbot] [v9fs?] KASAN: slab-use-after-free Write in v9fs_free_request

From: asmadeus
Date: Mon May 20 2024 - 03:33:04 EST


+To David as I need help with netfs

syzbot wrote on Sun, May 12, 2024 at 12:42:33PM -0700:
> UAF in
> Workqueue: events_unbound v9fs_upload_to_server_worker
> refcount_dec_and_test include/linux/refcount.h:325 [inline]
> p9_fid_put include/net/9p/client.h:275 [inline]
> v9fs_free_request+0x5f/0xe0 fs/9p/vfs_addr.c:128
> netfs_free_request+0x246/0x600 fs/netfs/objects.c:97
> v9fs_upload_to_server fs/9p/vfs_addr.c:36 [inline]
> v9fs_upload_to_server_worker+0x200/0x3e0 fs/9p/vfs_addr.c:44
> process_one_work kernel/workqueue.c:3267 [inline]

> Freed by task 32641:
> p9_fid_destroy net/9p/client.c:889 [inline]
> p9_client_destroy+0x1fb/0x660 net/9p/client.c:1070
> v9fs_session_close+0x51/0x210 fs/9p/v9fs.c:506
> v9fs_kill_super+0x5c/0x90 fs/9p/vfs_super.c:196
> deactivate_locked_super+0xc6/0x130 fs/super.c:472
> cleanup_mnt+0x426/0x4c0 fs/namespace.c:1267

That's a tough one: netfs took a ref in v9fs_init_request (netfs op's
init_request) and expects to be able to use it until v9fs_free_request
(net op's free_request()), but the fs was dismounted first and we kill
the kmem cache at this point so we aggressively drop any dangling ref
there as there's no way of waiting.
(this is corroborated by "9pnet: Found fid 1 not clunked" in dmesg in
the syzcaller logs)

The other two recent kasan errors are similar:
https://lkml.kernel.org/r/000000000000b86c5e06130da9c6@xxxxxxxxxx
is pretty much the same (it's just that the decrement here hit 0 as
umount was in the middle of doing it?), and
https://lkml.kernel.org/r/000000000000041f960618206d7e@xxxxxxxxxx
is yet another step faster (netfs freed the last ref while the cache
was being emptied and destroyed the fid first; which is possible because
we're not taking the client lock at this point as we weren't expecting
any other access after umount)

David, got an idea on how we could wait for these async writebacks?


Notes:
- David removed v9fs_upload_to_server in 2df86547b23d ("netfs: Cut
over to using new writeback code") (and c245868524cc ("netfs: Remove the
old writeback code")) in master, but the problem is still present
conceptually.
- layering wise, 9p (fs) depends on 9pnet, so 9pnet cannot call into the
fs code; the wait has to be in v9fs_session_close() before calling
p9_client_destroy or earlier


Thanks,
--
Dominique Martinet | Asmadeus