Re: [PATCH] block: optimise bvec_iter_advance()

From: Jens Axboe
Date: Sat Nov 30 2019 - 15:56:17 EST


On 11/30/19 12:11 PM, Pavel Begunkov wrote:
On 30/11/2019 21:57, Jens Axboe wrote:
On 11/30/19 10:56 AM, Arvind Sankar wrote:
On Sat, Nov 30, 2019 at 12:22:27PM +0300, Pavel Begunkov wrote:
On 30/11/2019 02:24, Arvind Sankar wrote:
On Sat, Nov 30, 2019 at 01:47:16AM +0300, Pavel Begunkov wrote:
On 30/11/2019 01:17, Arvind Sankar wrote:

The loop can be simplified a bit further, as done has to be 0 once we go
beyond the current bio_vec. See below for the simplified version.


Thanks for the suggestion! I thought about it, and decided to not
for several reasons. I prefer to not fine-tune and give compilers
more opportunity to do their job. And it's already fast enough with
modern architectures (MOVcc, complex addressing, etc).

Also need to consider code clarity and the fact, that this is inline,
so should be brief and register-friendly.


It should be more register-friendly, as it uses fewer variables, and I
think it's easier to see what the loop is doing, i.e. that we advance
one bio_vec per iteration: in the existing code, it takes a bit of
thinking to see that we won't spend more than one iteration within the
same bio_vec.

Yeah, may be. It's more the matter of preference then. I don't think
it's simpler, and performance is entirely depends on a compiler and
input. But, that's rather subjective and IMHO not worth of time.

Anyway, thanks for thinking this through!


You don't find listing 1 simpler than listing 2? It does save one
register, as it doesn't have to keep track of done independently from
bytes. This is always going to be the case unless the compiler can
eliminate done by transforming Listing 2 into Listing 1. Unfortunately,
even if it gets much smarter, it's unlikely to be able to do that,
because they're equivalent only if there is no overflow, so it would
need to know that bytes + iter->bi_bvec_done cannot overflow, and that
iter->bi_bvec_done must be smaller than cur->bv_len initially.

Listing 1:

bytes += iter->bi_bvec_done;
while (bytes) {
const struct bio_vec *cur = bv + idx;

if (bytes < cur->bv_len)
break;
bytes -= cur->bv_len;
idx++;
}

iter->bi_idx = idx;
iter->bi_bvec_done = bytes;

Listing 2:

while (bytes) {
const struct bio_vec *cur = bv + idx;
unsigned int len = min(bytes, cur->bv_len - done);

bytes -= len;
done += len;
if (done == cur->bv_len) {
idx++;
done = 0;
}
}

iter->bi_idx = idx;
iter->bi_bvec_done = done;

Have yet to take a closer look (and benchmark) and the patches and
the generated code, but fwiw I do agree that case #1 is easier to
read.

Ok, ok, I'm not keen on bike-shedding. I'll resend a simplified version

Sweet thanks. Make sure it's green.

--
Jens Axboe