Re: Expense of read_iter

From: Zhongwei Cai
Date: Fri Jan 15 2021 - 04:41:32 EST


On Thu, 14 Jan 2021, Mikulas wrote:

>> I'm working with Mingkai on optimizations for Ext4-dax.
>
> What specific patch are you working on? Please, post it somewhere.

Here is the work-in-progress patch: https://ipads.se.sjtu.edu.cn:1312/opensource/linux/-/tree/ext4-read
It only contains the "read" implementation for Ext4-dax now, though, we
will put other optimizations on it later.

> What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 )
> - detect in the "read_iter" method that there is just one segment and
> treat it like a "read" method. I think that it should improve performance
> for your case.

Note that the original Ext4-dax does not implement the "read" method. Instead, it
calls the "dax_iomap_rw" method provided by VFS. So we firstly rewrite
the "read-iter" method which iterates struct iov_iter and calls our
"read" method as a baseline for comparison.

Overall time of 2^26 4KB read:
"read-iter" method with dax-iomap-rw (original) - 36.477s
"read_iter" method wraps our "read" method - 28.950s
"read_iter" method tests for one entry proposed by Mikulas - 27.947s
"read" method - 26.899s

As we mentioned in the previous email (https://lkml.org/lkml/2021/1/12/710),
the overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

For Ext4-dax, the overhead of dax_iomap_rw is significant
compared to the overhead of struct iov_iter. Although methods
proposed by Mikulas can eliminate the overhead of iov_iter
well, they can not be applied in Ext4-dax unless we implement an
internal "read" method in Ext4-dax.

For Ext4-dax, there could be two approaches to optimizing:
1) implementing the internal "read" method without the complexity
of iterators and dax_iomap_rw; 2) optimizing how dax_iomap_rw works.
Since dax_iomap_rw requires ext4_iomap_begin, which further involves
the iomap structure and others (e.g., journaling status locks in Ext4),
we think implementing the internal "read" method would be easier.

As for whether the external .read interface in VFS should be reserved,
since there is still a performance gap (3.9%) between the "read" method
and the optimized "read_iter" method, we think reserving it is better.

Thanks,
Zhongwei