Re: fuse: increase readdir() buffer size

From: Jaco Kroon
Date: Sat Mar 29 2025 - 05:00:23 EST


Hi David,

On 2025/03/28 21:40, David Laight wrote:
On Fri, 28 Mar 2025 12:15:47 +0200
Jaco Kroon <jaco@xxxxxxxxx> wrote:

Hi All,

I've not seen feedback on this, please may I get some direction on this?
The only thing I can think of is that the longer loop might affect
scheduling latencies.

I'm assuming you're referring to the fact that it's now possible to iterate more times through the loop at the very last phase?

The first loop which sets up the size of the folio should only ever execute once unless there is fairly huge memory pressure and allocations fails.

The clamping code is unfortunate in my opinion, but it is possible that we can either get an insane huge count (based on inspection of other code, intended to be -1) or a zero value.

The upper limit here is arbitrary, but in the usual case I'm expecting a 128KiB buffer to be allocated, which should usually succeed on the first round.

The call to fuse_read_args_fill may result in marginally longer running code, along with the clamping code, but due to larger buffers of dentries being returned this is more than compensated for in terms of the drastic reduction in user-kernel space context switches by way of fewer getdents() system calls having to be made to iterate a full readdir().

Other systems may be more resilient, but glusterfs without my dentry cache patches has a tendency to "forget" dentries under pressure, and then have to restart getdents() runs via client-server round-trip hoops, often resulting in system call durations on getdents upwards of 30s at a time.  With this patch, the overall latency comes down drastically, and the number of these (inferred) restart runs on the server side to find the right place to continue reading from is also reduced, even without increasing the dentry cache in the glusterfs code.  These two patches in combination basically in my experience makes glusterfs operate no worse than NFS on average.

Given the feedback I'll discuss deploying updated kernels including these patches to our production rather than test environments (one node at a time, 21 in total) with the team.  And then provide feedback from there.  Our rule remains not more than one node at a time for potentially disruptive changes like these, and preferably no more than one node a day per environment.

In our test environment this panned out on newest (at time of mailing this in) kernel, and informal time trials (time find /mount/point, with 180m inodes) was quite positive, and orders of magnitude better than without (We killed the without after it took about 3x longer than with, still incomplete).

Kind regards,
Jaco



David

Kind regards,
Jaco

On 2025/03/15 00:16, Jaco Kroon wrote:
This is a follow up to the attempt made a while ago.

Whist the patch worked, newer kernels have moved from pages to folios,
which gave me the motivation to implement the mechanism based on the
userspace buffer size patch that Miklos supplied.

That patch works as is, but I note there are changes to components
(overlayfs and exportfs) that I've got very little experience with, and
have not tested specifically here. They do look logical. I've marked
Miklos as the Author: here, and added my own Signed-off-by - I hope this
is correct.

The second patch in the series implements the changes to fuse's readdir
in order to utilize the first to enable reading more than one page of
dent structures at a time from userspace, I've included a strace from
before and after this patch in the commit to illustrate the difference.

To get the relevant performance on glusterfs improved (which was
mentioned elsewhere in the thread) changes to glusterfs to increase the
number of cached dentries is also required (these are pushed to github
but not yet merged, because similar to this patch, got stalled before
getting to the "ready for merge" phase even though it's operational).

Please advise if these two patches looks good (I've only done relatively
basic testing now, and it's not running on production systems yet)

Kind regards,
Jaco