[HELP] FUSE writeback performance bottleneck

From: Jingbo Xu
Date: Mon Jun 03 2024 - 02:17:58 EST


Hi, Miklos,

We spotted a performance bottleneck for FUSE writeback in which the
writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
used for copy_page().

fuse_writepages_fill
alloc tmp_page
copy_highpage

This is because of FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), which newly allocates a temp page for
each dirty page to be written back, copy content of dirty page to temp
page, and then write back the temp page instead. This special design is
intentional to avoid potential deadlocked due to buggy or even malicious
fuse user daemon.

There was a proposal of removing this constraint for virtiofs [1], which
is reasonable as users of virtiofs and virtiofs daemon don't run on the
same OS, and virtiofs daemon is usually offered by cloud vendors that
shall not be malicious. While for the normal /dev/fuse interface, I
don't think removing the constraint is acceptable.


Come back to the writeback performance bottleneck. Another important
factor is that, (IIUC) only one kworker at the same time is allowed for
writeback for each filesystem instance (if cgroup writeback is not
enabled). The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
workqueue infrastructure guarantees that at most one *running* worker is
allowed for one specific work (sb->s_bdi->wb.dwork) at any time. Thus
the writeback is constraint to one CPU for each filesystem instance.

I'm not sure if offloading the page copying and then FUSE requests
sending to another worker (if a bunch of dirty pages have been
collected) is a good idea or not, e.g.

```
fuse_writepages_fill
if fuse_writepage_need_send:
# schedule a work

# the worker
for each dirty page in ap->pages[]:
copy_page
fuse_writepages_send
```

Any suggestion?



This issue can be reproduced by:

1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
source=/run/ /mnt
("/run/" is a tmpfs mount)

2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
--direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
(at least two threads are needed; fio shows ~1800MiB/s buffer write
bandwidth)


[1]
https://lore.kernel.org/all/20231228123528.705-1-lege.wang@xxxxxxxxxxxxxxx/


--
Thanks,
Jingbo