Re: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

From: Matthew Wilcox
Date: Tue Sep 01 2020 - 09:16:23 EST


On Tue, Sep 01, 2020 at 06:34:26AM +0100, Christoph Hellwig wrote:
> On Mon, Aug 31, 2020 at 08:48:37PM +0100, Matthew Wilcox wrote:
> > static void iomap_read_end_io(struct bio *bio)
> > {
> > int i, error = blk_status_to_errno(bio->bi_status);
> >
> > for (i = 0; i < bio->bi_vcnt; i++) {
> > struct bio_vec *bvec = &bio->bi_io_vec[i];
>
> This should probably use bio_for_each_bvec_all instead of directly
> poking into the bio. I'd also be tempted to move the loop body into
> a separate helper, but that's just a slight stylistic preference.

Ah, got it.

> > size_t offset = bvec->bv_offset;
> > size_t length = bvec->bv_len;
> > struct page *page = bvec->bv_page;
> >
> > while (length > 0) {
> > size_t count = thp_size(page) - offset;
> >
> > if (count > length)
> > count = length;
> > iomap_read_page_end_io(page, offset, count, error);
> > page += (offset + count) / PAGE_SIZE;
>
> Shouldn't the page_size here be thp_size?

No. Let's suppose we have a 20kB I/O which starts on a page boundary and
the first page is order-2. To get from the first head page to the second
page, we need to add 4, which is 16kB / 4kB, not 16kB / 16kB.

> > Maybe I'm missing something important here, but it's significantly
> > simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
> > (256 bytes less!) iomap_read_page_end_io is inlined into it both before
> > and after.
>
> Yes, that's exactly why I think avoiding bio_for_each_segment_all is
> a good idea in general.

I took out all the attempts to cope with insane bv_offset to compare like
with like and I got the bio_for_each_thp_segment_all() version down to
656 bytes:

@@ -166,21 +166,15 @@ static inline void bvec_thp_advance(const struct bio_vec *
bvec,
struct bvec_iter_all *iter_all)
{
struct bio_vec *bv = &iter_all->bv;
- unsigned int page_size;

if (iter_all->done) {
bv->bv_page += thp_nr_pages(bv->bv_page);
- page_size = thp_size(bv->bv_page);
bv->bv_offset = 0;
} else {
- bv->bv_page = thp_head(bvec->bv_page +
- (bvec->bv_offset >> PAGE_SHIFT));
- page_size = thp_size(bv->bv_page);
- bv->bv_offset = bvec->bv_offset -
- (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
- BUG_ON(bv->bv_offset >= page_size);
+ bv->bv_page = bvec->bv_page;
+ bv->bv_offset = bvec->bv_offset;
}
- bv->bv_len = min(page_size - bv->bv_offset,
+ bv->bv_len = min_t(unsigned int, thp_size(bv->bv_page) - bv->bv_offset,
bvec->bv_len - iter_all->done);
iter_all->done += bv->bv_len;

> And yes, eventually bv_page and bv_offset should be replaced with a
>
> phys_addr_t bv_phys;
>
> and life would become simpler in many places (and the bvec would
> shrink for most common setups as well).

I'd very much like to see that. It causes quite considerable pain for
our virtualisation people that we need a struct page. They'd like the
hypervisor to not have struct pages for the guest's memory, but if they
don't have them, they can't do I/O to them. Perhaps I'll try getting
one of them to work on this.

I'm not entirely sure the bvec would shrink. On 64-bit systems, it's
currently 8 bytes for the struct page, 4 bytes for the len and 4 bytes
for the offset. Sure, we can get rid of the offset, but the compiler
will just pad the struct from 12 bytes back to 16. On 32-bit systems
with 32-bit phys_addr_t, we go from 12 bytes down to 8, but most 32-bit
systems have a 64-bit phys_addr_t these days, don't they?

> For now I'd end up with something like:
>
> static void iomap_read_end_bvec(struct page *page, size_t offset,
> size_t length, int error)
> {
> while (length > 0) {
> size_t page_size = thp_size(page);
> size_t count = min(page_size - offset, length);
>
> iomap_read_page_end_io(page, offset, count, error);
>
> page += (offset + count) / page_size;
> length -= count;
> offset = 0;
> }
> }
>
> static void iomap_read_end_io(struct bio *bio)
> {
> int i, error = blk_status_to_errno(bio->bi_status);
> struct bio_vec *bvec;
>
> bio_for_each_bvec_all(bvec, bio, i)
> iomap_read_end_bvec(bvec->bv_page, bvec->bv_offset,
> bvec->bv_len, error;
> bio_put(bio);
> }
>
> and maybe even merge iomap_read_page_end_io into iomap_read_end_bvec.

The lines start to get a bit long. Here's what I currently have on
the write side:

@@ -1104,6 +1117,20 @@ iomap_finish_page_writeback(struct inode *inode, struct p
age *page,
end_page_writeback(page);
}

+static void iomap_finish_bvec_write(struct inode *inode, struct page *page,
+ size_t offset, size_t length, int error)
+{
+ while (length > 0) {
+ size_t count = min(thp_size(page) - offset, length);
+
+ iomap_finish_page_writeback(inode, page, error, count);
+
+ page += (offset + count) / PAGE_SIZE;
+ offset = 0;
+ length -= count;
+ }
+}
+
/*
* We're now finished for good with this ioend structure. Update the page
* state, release holds on bios, and finally free up memory. Do not use the
@@ -1121,7 +1148,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)

for (bio = &ioend->io_inline_bio; bio; bio = next) {
struct bio_vec *bv;
- struct bvec_iter_all iter_all;
+ int i;

/*
* For the last bio, bi_private points to the ioend, so we
@@ -1133,9 +1160,9 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
next = bio->bi_private;

/* walk each page on bio, ending page IO on them */
- bio_for_each_thp_segment_all(bv, bio, iter_all)
- iomap_finish_page_writeback(inode, bv->bv_page, error,
- bv->bv_len);
+ bio_for_each_bvec_all(bv, bio, i)
+ iomap_finish_bvec_writeback(inode, bv->bv_page,
+ bv->bv_offset, bv->bv_len, error);
bio_put(bio);
}
/* The ioend has been freed by bio_put() */

That's a bit more boilerplate than I'd like, but if bio_vec is going to
lose its bv_page then I don't see a better way. Unless we come up with
a different page/offset/length struct that bio_vecs are decomposed into.