Re: [PATCH] btrfs: also add stripe entries for NOCOW writes

From: Qu Wenruo
Date: Mon Sep 23 2024 - 04:54:39 EST




在 2024/9/23 17:45, Johannes Thumshirn 写道:
On 23.09.24 09:56, Qu Wenruo wrote:
在 2024/9/23 16:15, Johannes Thumshirn 写道:
From: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>

NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also allow NOCOW writes for zoned filesystems.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>

Sorry I'm going to repeat myself again, I still believe if we insert an
RST entry at falloc() time, it will be more consistent with the non-RST
code.

Yes, I known preallocated space will not need any read nor search RST
entry, and we just fill the page cache with zero at read time.

But the point of proper (not just dummy) RST entry for the whole
preallocated space is, we do not need to touch the RST entry anymore for
NOCOW/PREALLOCATED write at all.

This makes the RST NOCOW/PREALLOC writes behavior to align with the
non-RST code, which doesn't update any extent item, but only modify the
file extent for PREALLOC writes.

Please re-read the patch. This is not a dummy RST entry but a real RST
entry for NOCOW writes.

I know, but my point is, if the RST entry for preallocated range is
already a regular one, you won't even need to insert/update the RST tree
at all.

Just like we do not need to update the extent tree for
NOCOW/PREALLOCATED writes.

But as long as there is no data on disk there is no point of having a
logical to N-disk physical mapping. We haven't even called
btrfs_map_block() before, so where do we know which disks will get the
blocks at which address? Look at this example:

Fallocate [0, 1M]
virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test
[...]


[1] we preallocate the data for [0, 1M] @ 298844160
[2] we have the actual written data for [64k, 128k] @ 298844160

What should I do to insert the RST entry there as we get:

Do the needed write map and insert the RST entries, just as if we're
doing a write, but without any actual IO.


Currently the handling of RST is not consistent with non-RST, thus
that's the reason causing problems with scrub on preallocated extents.

I known preallocated range won't trigger any read thus it makes no sense
to do the proper RST setup.
But let's also take the example of non-RST scrub:

Do we spend our time checking if a data extent is preallocated or not?
No, we do not. We just read the content, and check against its csum.
For preallocated extents, it just has no csum.

You can argue that this is wasting IO, but I can also counter-argue that
we need to make sure the read on that device range is fine, even if we
know we will not really read the content (but that's only for now).


Furthermore, from the logical aspect, at least to me, non-RST case is
just a special case where logical address is 1:1 mapped already.

This also means, even for preallocated extents, they should have RST
entries.


Finally, I do not think it's a good idea to insert RST entries for NOCOW.
If a file is set NOCOW, it means we'll doing a lot of overwrite for it.
Then why waste our time updating the RST entries again and again?

Isn't such behavior going to cause more write amplification? Meanwhile
for non-RST cases, NOCOW should cause the least amount of write
amplification.



So again, YES, I know doing proper write map and inserting RST entries
for preallocated range makes no sense for READ.
But preallocation and NOCOW is utilized for contents we frequently
over-writes, and such RST entries save us more for those frequently
over-writes.

Thanks,
Qu

[3] the physical copies starting at 298909696 on devid 1 and 277938176
on devid 2