[PATCH v4 00/13] iov_iter: Convert the iterator macros into inline funcs

From: David Howells
Date: Wed Sep 13 2023 - 12:57:49 EST


Hi Al, Linus,

Here's my latest go at converting the iov_iter iteration macros to be
inline functions. The first four functions are the main part of the
series:

(1) To the recently added iov_iter kunit tests add three benchmarking
tests that test copying 256MiB from one buffer to another using three
different methods (single-BVEC over the whole buffer, BVEC's allocated
dynamically for 256-page chunks and whole-buffer XARRAY). The results
are dumped to dmesg. No setting up is required with null blockdevs or
anything like that.

(2) Renumber the type enum so that the ITER_* constants match the order in
iterate_and_advance*().

(3) Since (2) puts UBUF and IOVEC at 0 and 1, change user_backed_iter() to
just use the type value and get rid of the extra flag.

(4) Converts the iov_iter iteration macros to always-inline functions to
make the code easier to follow. It uses function pointers, but they
get optimised away. The priv2 argument likewise gets optimised away
if unused.

The rest of the patches are some options for consideration:

(5) Move the iov_iter iteration macros to a header file so that bespoke
iterators can be created elsewhere. For instance, rbd has an
optimisation that requires it to scan to the buffer it is given to see
if it is all zeros. It would be nice if this could use
iterate_and_advance() - but that's currently buried inside
lib/iov_iter.c.

(6) On top of (5), provide a cut-down iteration function that can only
handle kernel-backed iterators (ie. BVEC, KVEC, XARRAY and DISCARD)
for situations where we know that we won't see UBUF/IOVEC.

(7-8) Make copy_to/from_iter() always catch MCE and return a short copy.
This doesn't particularly increase the code size as the handling works
via exception handling tables. That said, there may be code that
doesn't check to result of the copy that could be adversely affected.
If we go with this, it might be worth having an 'MCE happened' flag in
the iterator or something by which this can be checked for.

[?] Is it better to kill the thread than returning a short copy if an
MCE occurs?
[?] Is it better to manually select MCE handling?

(9) On top of (5), move the copy-and-csum code to net/ where it can be in
proximity with the code that uses it. This eliminates the code if
CONFIG_NET=n and allows for the slim possibility of it being inlined.

(10) On top of (9), fold memcpy_and_csum() in to its two users.

(11) On top of (9), move csum_and_copy_from_iter_full() out of line and
merge in csum_and_copy_from_iter() since the former is the only caller
of the latter.

(12) Move hash_and_copy_to_iter() to net/ where it can be with its only
caller.

(13) Add a testing misc device for testing/benchmarking ITER_UBUF and
ITER_IOVEC devices. It is driven by read/write/readv/writev and the
results dumped through a tracepoint.

Further changes I could make:

(1) Add an 'ITER_OTHER' type and an ops table pointer and have
iterate_and_advance2(), iov_iter_advance(), iov_iter_revert(),
etc. jump through it if it sees ITER_OTHER type. This would allow
types for, say, scatterlist, bio list, skbuff to be added without
further expanding the core.

(2) Move the ITER_XARRAY type to being an ITER_OTHER type. This would
shrink the core iterators quite a lot and reduce the stack usage as
the xarray walking stuff wouldn't be there.

Anyway, the changes in compiled function size either side of patch (4) on
x86_64 look like:

_copy_from_iter inc 0x360 -> 0x3d5 +0x75
_copy_from_iter_flushcache inc 0x34c -> 0x358 +0xc
_copy_from_iter_nocache dcr 0x354 -> 0x346 -0xe
_copy_mc_to_iter inc 0x396 -> 0x3cf +0x39
_copy_to_iter inc 0x33b -> 0x35d +0x22
copy_page_from_iter_atomic.part.0 inc 0x393 -> 0x408 +0x75
copy_page_to_iter_nofault.part.0 dcr 0x3de -> 0x3b2 -0x2c
copyin del 0x30
copyout del 0x2d
copyout_mc del 0x2b
csum_and_copy_from_iter inc 0x3db -> 0x3f4 +0x19
csum_and_copy_to_iter dcr 0x45d -> 0x45b -0x2
iov_iter_zero dcr 0x34a -> 0x342 -0x8
memcpy_from_iter.isra.0 del 0x1f
memcpy_from_iter_mc new 0x2b

Note that there's a noticeable expansion on some of the main functions
because a number of the helpers get inlined instead of being called.

In terms of benchmarking patch (4), three runs without it:

# iov_kunit_benchmark_bvec: avg 3175 uS
# iov_kunit_benchmark_bvec_split: avg 3404 uS
# iov_kunit_benchmark_xarray: avg 3611 uS
# iov_kunit_benchmark_bvec: avg 3175 uS
# iov_kunit_benchmark_bvec_split: avg 3403 uS
# iov_kunit_benchmark_xarray: avg 3611 uS
# iov_kunit_benchmark_bvec: avg 3172 uS
# iov_kunit_benchmark_bvec_split: avg 3401 uS
# iov_kunit_benchmark_xarray: avg 3614 uS

and three runs with it:

# iov_kunit_benchmark_bvec: avg 3141 uS
# iov_kunit_benchmark_bvec_split: avg 3405 uS
# iov_kunit_benchmark_xarray: avg 3546 uS
# iov_kunit_benchmark_bvec: avg 3140 uS
# iov_kunit_benchmark_bvec_split: avg 3405 uS
# iov_kunit_benchmark_xarray: avg 3546 uS
# iov_kunit_benchmark_bvec: avg 3138 uS
# iov_kunit_benchmark_bvec_split: avg 3401 uS
# iov_kunit_benchmark_xarray: avg 3542 uS

It looks like patch (4) makes things a little bit faster, probably due to
the extra inlining.

I've pushed the patches here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-cleanup

David

Changes
=======
ver #4)
- Fix iterate_bvec() and iterate_kvec() to update iter->bvec and
iter->kvec after subtracting it to calculate iter->nr_segs.
- Change iterate_xarray() to use start+progress rather than increasing
start to reduce code size.
- Added patches to move some iteration functions over to net/ as the files
there can #include the iterator framework.
- Added a patch to benchmark the iteration.

ver #3)
- Use min_t(size_t,) not min() to avoid a warning on Hexagon.
- Inline all the step functions.
- Added a patch to better handle copy_mc.

ver #2)
- Rebased on top of Willy's changes in linux-next.
- Change the checksum argument to the iteration functions to be a general
void* and use it to pass iter->copy_mc flag to memcpy_from_iter_mc() to
avoid using a function pointer.
- Arrange the end of the iterate_*() functions to look the same to give
the optimiser the best chance.
- Make iterate_and_advance() a wrapper around iterate_and_advance2().
- Adjust iterate_and_advance2() to use if-else-if-else-if-else rather than
switch(), to put ITER_BVEC before KVEC and to mark UBUF and IOVEC as
likely().
- Move "iter->count += progress" into iterate_and_advance2() from the
iterate functions.
- Mark a number of the iterator helpers with __always_inline.
- Fix _copy_from_iter_flushcache() to use memcpy_from_iter_flushcache()
not memcpy_from_iter().

Link: https://lore.kernel.org/r/3710261.1691764329@xxxxxxxxxxxxxxxxxxxxxx/ # v1
Link: https://lore.kernel.org/r/855.1692047347@xxxxxxxxxxxxxxxxxxxxxx/ # v2
Link: https://lore.kernel.org/r/20230816120741.534415-1-dhowells@xxxxxxxxxx/ # v3

David Howells (13):
iov_iter: Add a benchmarking kunit test
iov_iter: Renumber ITER_* constants
iov_iter: Derive user-backedness from the iterator type
iov_iter: Convert iterate*() to inline funcs
iov: Move iterator functions to a header file
iov_iter: Add a kernel-type iterator-only iteration function
iov_iter: Make copy_from_iter() always handle MCE
iov_iter: Remove the copy_mc flag and associated functions
iov_iter, net: Move csum_and_copy_to/from_iter() to net/
iov_iter, net: Fold in csum_and_memcpy()
iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
iov_iter, net: Move hash_and_copy_to_iter() to net/
iov_iter: Create a fake device to allow iov_iter testing/benchmarking

arch/x86/include/asm/mce.h | 23 ++
fs/coredump.c | 1 -
include/linux/iov_iter.h | 296 +++++++++++++++++++++++++
include/linux/skbuff.h | 3 +
include/linux/uio.h | 45 +---
lib/Kconfig.debug | 8 +
lib/Makefile | 1 +
lib/iov_iter.c | 429 +++++++++++--------------------------
lib/kunit_iov_iter.c | 181 ++++++++++++++++
lib/test_iov_iter.c | 134 ++++++++++++
lib/test_iov_iter_trace.h | 80 +++++++
net/core/datagram.c | 75 ++++++-
net/core/skbuff.c | 40 ++++
13 files changed, 966 insertions(+), 350 deletions(-)
create mode 100644 include/linux/iov_iter.h
create mode 100644 lib/test_iov_iter.c
create mode 100644 lib/test_iov_iter_trace.h