Re: fuse: increase readdir() buffer size
From: Jaco Kroon
Date: Sat Mar 29 2025 - 05:00:23 EST
Hi David,
On 2025/03/28 21:40, David Laight wrote:
On Fri, 28 Mar 2025 12:15:47 +0200
Jaco Kroon <jaco@xxxxxxxxx> wrote:
Hi All,
I've not seen feedback on this, please may I get some direction on this?
The only thing I can think of is that the longer loop might affect
scheduling latencies.
I'm assuming you're referring to the fact that it's now possible to
iterate more times through the loop at the very last phase?
The first loop which sets up the size of the folio should only ever
execute once unless there is fairly huge memory pressure and allocations
fails.
The clamping code is unfortunate in my opinion, but it is possible that
we can either get an insane huge count (based on inspection of other
code, intended to be -1) or a zero value.
The upper limit here is arbitrary, but in the usual case I'm expecting a
128KiB buffer to be allocated, which should usually succeed on the first
round.
The call to fuse_read_args_fill may result in marginally longer running
code, along with the clamping code, but due to larger buffers of
dentries being returned this is more than compensated for in terms of
the drastic reduction in user-kernel space context switches by way of
fewer getdents() system calls having to be made to iterate a full readdir().
Other systems may be more resilient, but glusterfs without my dentry
cache patches has a tendency to "forget" dentries under pressure, and
then have to restart getdents() runs via client-server round-trip hoops,
often resulting in system call durations on getdents upwards of 30s at a
time. With this patch, the overall latency comes down drastically, and
the number of these (inferred) restart runs on the server side to find
the right place to continue reading from is also reduced, even without
increasing the dentry cache in the glusterfs code. These two patches in
combination basically in my experience makes glusterfs operate no worse
than NFS on average.
Given the feedback I'll discuss deploying updated kernels including
these patches to our production rather than test environments (one node
at a time, 21 in total) with the team. And then provide feedback from
there. Our rule remains not more than one node at a time for
potentially disruptive changes like these, and preferably no more than
one node a day per environment.
In our test environment this panned out on newest (at time of mailing
this in) kernel, and informal time trials (time find /mount/point, with
180m inodes) was quite positive, and orders of magnitude better than
without (We killed the without after it took about 3x longer than with,
still incomplete).
Kind regards,
Jaco
David
Kind regards,
Jaco
On 2025/03/15 00:16, Jaco Kroon wrote:
This is a follow up to the attempt made a while ago.
Whist the patch worked, newer kernels have moved from pages to folios,
which gave me the motivation to implement the mechanism based on the
userspace buffer size patch that Miklos supplied.
That patch works as is, but I note there are changes to components
(overlayfs and exportfs) that I've got very little experience with, and
have not tested specifically here. They do look logical. I've marked
Miklos as the Author: here, and added my own Signed-off-by - I hope this
is correct.
The second patch in the series implements the changes to fuse's readdir
in order to utilize the first to enable reading more than one page of
dent structures at a time from userspace, I've included a strace from
before and after this patch in the commit to illustrate the difference.
To get the relevant performance on glusterfs improved (which was
mentioned elsewhere in the thread) changes to glusterfs to increase the
number of cached dentries is also required (these are pushed to github
but not yet merged, because similar to this patch, got stalled before
getting to the "ready for merge" phase even though it's operational).
Please advise if these two patches looks good (I've only done relatively
basic testing now, and it's not running on production systems yet)
Kind regards,
Jaco