Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

From: Antonio SJ Musumeci
Date: Mon Apr 08 2024 - 10:27:12 EST


On 4/8/24 01:32, Sweet Tea Dorminy wrote:
>
> On 2024-01-26 01:29, Jingbo Xu wrote:
>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>
>>>
>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>>>
>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of
>>>>> a
>>>>> single request is increased.
>>>>
>>>> The only worry is about where this memory is getting accounted to.
>>>> This needs to be thought through, since the we are increasing the
>>>> possible memory that an unprivileged user is allowed to pin.
>>
>> Apart from the request size, the maximum number of background requests,
>> i.e. max_background (12 by default, and configurable by the fuse
>> daemon), also limits the size of the memory that an unprivileged user
>> can pin. But yes, it indeed increases the number proportionally by
>> increasing the maximum request size.
>>
>>
>>>
>>>> It would be interesting to
>>>> see the how the number of pages per request affects performance and
>>>> why.
>>>
>>> To be honest, I'm not sure the root cause of the performance boost in
>>> bytedance's case.
>>>
>>> While in our internal use scenario, the optimal IO size of the backend
>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>> throughput can not be achieved with current 256 pages per request. IOW
>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>> performance when the data is aligned at 4MB boundary. I can ask my
>>> folk
>>> who implements the fuse server to give more background info and the
>>> exact performance statistics.
>>
>> Here are more details about our internal use case:
>>
>> We have a fuse server used in our internal cloud scenarios, while the
>> backend store is actually a distributed filesystem. That is, the fuse
>> server actually plays as the client of the remote distributed
>> filesystem. The fuse server forwards the fuse requests to the remote
>> backing store through network, while the remote distributed filesystem
>> handles the IO requests, e.g. process the data from/to the persistent
>> store.
>>
>> Then it comes the details of the remote distributed filesystem when it
>> process the requested data with the persistent store.
>>
>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>> (ErasureCode), where each fixed sized user data is split and stored as
>> 8
>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>> block size, for each 4MB user data, it's split and stored as 8 (512
>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>
>> It also utilize the stripe technology to boost the performance, for
>> example, there are 8 data disks and 3 parity disks in the above 8+3
>> mode
>> example, in which each stripe consists of 8 data blocks and 3 parity
>> blocks.
>>
>> [2] To avoid data corruption on power off, the remote distributed
>> filesystem commit a O_SYNC write right away once a write (fuse) request
>> received. Since the EC described above, when the write fuse request is
>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size,
>> the
>> other 3MB is read from the persistent store first, then compute the
>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>> the 8 data blocks and 3 parity blocks down.
>>
>>
>> Thus the write amplification is un-neglectable and is the performance
>> bottleneck when the fuse request size is less than the stripe size.
>>
>> Here are some simple performance statistics with varying request size.
>> With 4MB stripe size, there's ~3x bandwidth improvement when the
>> maximum
>> request size is increased from 256KB to 3.9MB, and another ~20%
>> improvement when the request size is increased to 4MB from 3.9MB.
>
> To add my own performance statistics in a microbenchmark:
>
> Tested on both small VM and large hardware, with suitably large
> FUSE_MAX_MAX_PAGES, using a simple fuse filesystem whose write handlers
> did basically nothing but read the data buffers (memcmp() each 8 bytes
> of data provided against a variable), I ran fio with 128M blocksize,
> end_fsync=1, psync IO engine, times each of 4 parallel jobs. Throughput
> was as follows over variable write_size in MB/s.
>
> write_size machine1 machine2
> 32M 1071 6425
> 16M 1002 6445
> 8M 890 6443
> 4M 713 6342
> 2M 557 6290
> 1M 404 6201
> 512K 268 6041
> 256K 156 5782
>
> Even on the fast machine, increasing the buffer size to 8M is worth 3.9%
> over keeping it at 1M, and is worth over 2x on the small VM. We are
> striving to reduce the ingestion speed in particular as we have seen
> that as a limiting factor on some machines, and there's a clear plateau
> reached around 8M. While most fuse servers would likely not benefit from
> this, and others would benefit from fuse passthrough instead, it does
> seem like a performance win.
>
> Perhaps, in analogy to soft and hard limits on pipe size,
> FUSE_MAX_MAX_PAGES could be increased and treated as the maximum
> possible hard limit for max_write; and the default hard limit could stay
> at 1M, thereby allowing folks to opt into the new behavior if they care
> about the performance more than the memory?
>
> Sweet Tea

As I recall the concern about increased message sizes is that it gives a
process the ability to allocate non-insignificant amounts of kernel
memory. Perhaps the limits could be expanded only if the server has
SYS_ADMIN cap.