Re: [HELP] FUSE writeback performance bottleneck
From: Bernd Schubert
Date: Mon Jun 03 2024 - 10:43:37 EST
On 6/3/24 08:17, Jingbo Xu wrote:
> Hi, Miklos,
>
> We spotted a performance bottleneck for FUSE writeback in which the
> writeback kworker has consumed nearly 100% CPU, among which 40% CPU is
> used for copy_page().
>
> fuse_writepages_fill
> alloc tmp_page
> copy_highpage
>
> This is because of FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap")), which newly allocates a temp page for
> each dirty page to be written back, copy content of dirty page to temp
> page, and then write back the temp page instead. This special design is
> intentional to avoid potential deadlocked due to buggy or even malicious
> fuse user daemon.
I also noticed that and I admin that I don't understand it yet. The commit says
<quote>
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
</quote>
Timing - NFS/cifs/etc have the same issue? Even a local file system has no guarantees
how fast storage is?
Buggy - hmm yeah, then it is splice related only? But I think splice feature was
not introduced yet when fuse got mmap and writeback in 2008?
Without splice the pages are just copied into a userspace buffer? So what can
userspace do wrong with its copy?
Failure - why can't it do what nfs_mapping_set_error() does?
I guess I miss something, but so far I don't understand what that is.
>
> There was a proposal of removing this constraint for virtiofs [1], which
> is reasonable as users of virtiofs and virtiofs daemon don't run on the
> same OS, and virtiofs daemon is usually offered by cloud vendors that
> shall not be malicious. While for the normal /dev/fuse interface, I
> don't think removing the constraint is acceptable.
>
>
> Come back to the writeback performance bottleneck. Another important
> factor is that, (IIUC) only one kworker at the same time is allowed for
> writeback for each filesystem instance (if cgroup writeback is not
> enabled). The kworker is scheduled upon sb->s_bdi->wb.dwork, and the
> workqueue infrastructure guarantees that at most one *running* worker is
> allowed for one specific work (sb->s_bdi->wb.dwork) at any time. Thus
> the writeback is constraint to one CPU for each filesystem instance.
>
> I'm not sure if offloading the page copying and then FUSE requests
> sending to another worker (if a bunch of dirty pages have been
> collected) is a good idea or not, e.g.
>
> ```
> fuse_writepages_fill
> if fuse_writepage_need_send:
> # schedule a work
>
> # the worker
> for each dirty page in ap->pages[]:
> copy_page
> fuse_writepages_send
> ```
>
> Any suggestion?
>
>
>
> This issue can be reproduced by:
>
> 1 ./libfuse/build/example/passthrough_ll -o cache=always -o writeback -o
> source=/run/ /mnt
> ("/run/" is a tmpfs mount)
>
> 2 fio --name=write_test --ioengine=psync --iodepth=1 --rw=write --bs=1M
> --direct=0 --size=1G --numjobs=2 --group_reporting --directory=/mnt
> (at least two threads are needed; fio shows ~1800MiB/s buffer write
> bandwidth)
That should quickly run out of tmpfs memory. I need to find time to improve
this a bit, but this should give you an easier test:
https://github.com/libfuse/libfuse/pull/807
>
>
> [1]
> https://lore.kernel.org/all/20231228123528.705-1-lege.wang@xxxxxxxxxxxxxxx/
>
>
Thanks,
Bernd