[performance] fuse: No Significant Performance Improvement with Passthrough Enabled?

From: abushwang
Date: Wed Nov 27 2024 - 23:04:51 EST


I recently learned that FUSE has introduced passthrough support, which appears to significantly enhance performance, as discussed in this article: [LWN.net](https://lwn.net/Articles/832430/).

I plan to develop some upper-layer applications based on this feature. However, during my testing, I found that the performance of passthrough for reading small files seems to be worse than that without passthrough. Below are the details of my test cases:
https://github.com/wswsmao/fuse-performance/blob/main/file_access_test.c

I generated files of sizes 1M, 500M, and 1000M using the aforementioned use case for reading.
https://github.com/wswsmao/fuse-performance/blob/main/generate_large_files.sh

### Test Environment Information:

```
$ uname -r
6.11.5-200.fc40.x86_64
```

```
$mount
/dev/vda1 on / type ext4 (rw,noatime)
...

```

### Testing Steps:

I cloned the latest code from the libfuse upstream community and compiled it to obtain passthrough_hp.

The latest passthrough_hp supports passthrough by default. Therefore, when testing with passthrough, I used the following command:

```
ls -lh source_dir/
total 1.5G
-rw-r--r-- 1 root root  1.0M Nov 28 02:45 sequential_file_1
-rw-r--r-- 1 root root  500M Nov 28 02:45 sequential_file_2
-rw-r--r-- 1 root root 1000M Nov 28 02:45 sequential_file_3

./lattest_passthrough_hp source_dir/ mount_point/
```

For testing without passthrough, I used the following command:

```
./lattest_passthrough_hp source_dir/ mount_point/ --nopassthrough
```

Then, I executed the test script on mount_point.


During debugging, in a scenario with a 1M buffer set to 4K, I added print statements in the FUSE daemon's read function. In the without passthrough mode, I observed 11 print statements, with the maximum read size being 131072. Additionally, I captured 11 fuse_readahead operations using ftrace. However, in passthrough mode, even after increasing the ext4 read-ahead size using the command `blockdev --setra $num /dev/vda1`, the performance improvement was not significant.

I would like to understand why, in this case, the performance of passthrough seems to be inferior to that of without passthrough.

Thank you for your assistance.

Best regards,

Abushwang

Attached is my test report for your reference.

## without passthrough

### Size = 1.0M

| Mode       | Buffer Size | Time (ms) | Read Calls |
| ------------ | ------------- | ----------- | ------------ |
| sequential | 4096        | 7.99      | 256        |
| sequential | 131072      | 6.46      | 8          |
| sequential | 262144      | 7.52      | 4          |
| random     | 4096        | 51.40     | 256        |
| random     | 131072      | 10.62     | 8          |
| random     | 262144      | 8.69      | 4          |


### Size = 500M

| Mode       | Buffer Size | Time (ms) | Read Calls |
| ------------ | ------------- | ----------- | ------------ |
| sequential | 4096        | 3662.68   | 128000     |
| sequential | 131072      | 3399.55   | 4000       |
| sequential | 262144      | 3565.99   | 2000       |
| random     | 4096        | 28444.48  | 128000     |
| random     | 131072      | 5012.85   | 4000       |
| random     | 262144      | 3636.87   | 2000       |

### Size = 1000M

| Mode       | Buffer Size | Time (ms) | Read Calls |
| ------------ | ------------- | ----------- | ------------ |
| sequential | 4096        | 8164.34   | 256000     |
| sequential | 131072      | 7704.75   | 8000       |
| sequential | 262144      | 7970.08   | 4000       |
| random     | 4096        | 57275.82  | 256000     |
| random     | 131072      | 10311.90  | 8000       |
| random     | 262144      | 7839.20   | 4000       |


## with passthrough

### Size = 1.0M

| Mode       | Buffer Size | Time (ms) | Read Calls |
| ------------ | ------------- | ----------- | ------------ |
| sequential | 4096        | 8.50      | 256        |
| sequential | 131072      | 7.54      | 8          |
| sequential | 262144      | 8.71      | 4          |
| random     | 4096        | 52.16     | 256        |
| random     | 131072      | 9.10      | 8          |
| random     | 262144      | 9.54      | 4          |


### Size = 500M

| Mode       | Buffer Size | Time (ms) | Read Calls |
| ------------ | ------------- | ----------- | ------------ |
| sequential | 4096        | 3320.70   | 128000     |
| sequential | 131072      | 3234.08   | 4000       |
| sequential | 262144      | 2881.98   | 2000       |
| random     | 4096        | 28457.52  | 128000     |
| random     | 131072      | 4558.78   | 4000       |
| random     | 262144      | 3476.05   | 2000       |


### Size = 1000M

| Mode       | Buffer Size | Time (ms) | Read Calls |
| ------------ | ------------- | ----------- | ------------ |
| sequential | 4096        | 6842.04   | 256000     |
| sequential | 131072      | 6677.01   | 8000       |
| sequential | 262144      | 6268.29   | 4000       |
| random     | 4096        | 58478.65  | 256000     |
| random     | 131072      | 9435.85   | 8000       |
| random     | 262144      | 7031.16   | 4000       |