Re: [PATCH 2/2] fuse: Adjust readdir() buffer to requesting buffer size.

From: Jaco Kroon
Date: Mon Mar 31 2025 - 16:44:04 EST


Hi,

On 2025/03/31 18:41, Joanne Koong wrote:
On Fri, Mar 14, 2025 at 3:39 PM Jaco Kroon<jaco@xxxxxxxxx> wrote:
Clamp to min 1 page (4KB) and max 128 pages (512KB).

Glusterfs trial using strace ls -l.

Before:

getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 616
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 624
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 624
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 600
getdents64(3, 0x7f2d7d7a7040 /* 25 entries */, 131072) = 608
getdents64(3, 0x7f2d7d7a7040 /* 1 entries */, 131072) = 24
getdents64(3, 0x7f2d7d7a7040 /* 0 entries */, 131072) = 0

After:

getdents64(3, 0x7ffae8eed040 /* 276 entries */, 131072) = 6696
getdents64(3, 0x7ffae8eed040 /* 0 entries */, 131072) = 0

Signed-off-by: Jaco Kroon<jaco@xxxxxxxxx>
---
fs/fuse/readdir.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index 17ce9636a2b1..a0ccbc84b000 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -337,11 +337,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
struct fuse_mount *fm = get_fuse_mount(inode);
struct fuse_io_args ia = {};
struct fuse_args_pages *ap = &ia.ap;
- struct fuse_folio_desc desc = { .length = PAGE_SIZE };
+ struct fuse_folio_desc desc = { .length = ctx->count };
u64 attr_version = 0, evict_ctr = 0;
bool locked;
+ int order;

- folio = folio_alloc(GFP_KERNEL, 0);
+ if (desc.length < PAGE_SIZE)
+ desc.length = PAGE_SIZE;
+ else if (desc.length > (PAGE_SIZE << 7)) /* 128 pages, typically 512KB */
+ desc.length = PAGE_SIZE << 7;
+
Just wondering, how did 128 pages get decided as the upper bound? It
seems to me to make more sense if the upper bound is fc->max_pages.

Best answer ... random/guess at something which may be sensible.

Also btw, I think you can just use the clamp() helper from
<linux/minmax.h> to do the clamping

Thanks.  Not a regular contributor to the kernel, not often that I've got an itch that needs scratching here :).

So something like this then:

345
346     desc.length = clamp(desc.length, PAGE_SIZE, fm->fc->max_pages << CONFIG_PAGE_SHIFT);
347     order = get_count_order(desc.length >> CONFIG_PAGE_SHIFT);
348

Note:  Can use ctx->count here in clamp directly due to it being signed, where desc.length is unsigned.

I'm *assuming* get_count_order will round-up, so if max_pages is 7 (if non-power of two is even possible) we will really get 8 pages here?

Compile tested only.  Will perform basic run-time test before re-submit.

+ order = get_count_order(desc.length >> CONFIG_PAGE_SHIFT);
+
+ do {
+ folio = folio_alloc(GFP_KERNEL, order);
Folios can now be larger than one page size for readdir requests with
your change but I don't believe the current page copying code in fuse
supports this yet. For example, I think the kmapping will be
insufficient in fuse_copy_page() where in the current code we kmap
only the first page in the folio. I sent a patch for supporting large
folios page copying [1] and am trying to get this merged in but
haven't heard back about this patchset yet. In your local tests that
used multiple pages for the readdir request, did you run into any
issues or it worked fine?

My tests boiled down to running strace as per above, and then some basic time trials using find /path/to/mount/point with and without the patch over a fairly large structure containing about 170m inodes.  No problems observed.  That said ... I've done similar before, and then introduced a major memory leak that under load destroyed 100GB of RAM in minutes.  Thus why I'm looking for a few eyeballs on this before going to production (what we have works, it's just on an older kernel).

If further improvements are possible that would be great, but based on testing this is already at least a 10x improvement on readdir() performance.

[1]https://lore.kernel.org/linux-fsdevel/20250123012448.2479372-2-joannelkoong@xxxxxxxxx/
Took a quick look, wish I could provide you some feedback but that's beyond my kernel skill set to just eyeball.

Looks like you're primarily getting rid of the code that references the pages inside the folio's and just operating on the folio's directly? A side effect of which (your goal) is to enable larger copies rather than small ones?

Thank you,
Jaco