Re: [PATCH RFC] block: trace: add block alignment information

From: John Garry
Date: Fri Sep 13 2024 - 08:09:30 EST


On 13/09/2024 12:26, Daniel Gomez wrote:
On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
From: Daniel Gomez <da.gomez@xxxxxxxxxxx>

Report block alignment in terms of LBA and size during block tracing for
block_rq. Calculate alignment only for read/writes where the length is
greater than 0. Otherwise, report 0 to indicate no alignment calculated.

Suggested-by: Dave Chinner <dchinner@xxxxxxxxxx>
Signed-off-by: Daniel Gomez <da.gomez@xxxxxxxxxxx>
---
This patch introduces LBA and size alignment information for
the block_rq tracepoints (block_rq{insert, issue, merge} and
block_{io_start, io_done}).

eh, shouldn't this belong in the description of the patch?

Yes. I'll move this to the commit message.


And I still don't know what we mean by alignment in this context.

From looking at the code, it seems to be the max detected block size
granularity. For example, for a 64KB write at a 32KB offset, that would give
a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2

note: I meant "8KB is the largest power-of-2"

which is divisible into 24KB. Is this a correct understanding?

That is correct.

So maybe it's me, but I just find it odd to call this information "alignment". To me, what you are looking for is largest block size granularity.

> Do you think adding examples like yours can help to explain
> this better?
Below the same examples using fio with the trace output:


sudo fio -bs=64k -size=64k -offset=32k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync

sudo fio -bs=64k -size=64k -offset=64k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync

sudo fio -bs=8k -size=8k -offset=64k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync

sudo fio -bs=24k -size=24k -offset=0k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync

fio-789 [000] ..... 4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
fio-801 [000] ..... 4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
fio-813 [000] ..... 4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
fio-825 [000] ..... 4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]


Also, the motivation behind this is explained in the LBS RFC [1] and I should
have included it here for context. I hope [1] and my description below helps to
explain what alignment means and why is needed:

[1] Subject: [RFC 00/23] Enable block size > page size in XFS
https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@xxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$

Tracing alignment information is important for high-capacity and QLC SSDs with
Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
boundaries can imply in a read-modify-write (RMW).

Yes, I get that this might be important to know.


The graph below is a representation of the device IU vs an I/O block aligned/
unaligned.

|--- IU Boundaries ----| |-PS-|
a) [====][====][====][====][····][····][····]--
| |
b) [····][====][====][====][====][····][····]--
| |
c) [====][====][====][====][····][====][====]--

is there meant to be a gap at page index #4?

| |
d) [····][····][====][====][····][····][····]--
| |
LBA 0 4
Key:
[====] = I/O Block
[····] = Memory in Page Size (PS) chunks

a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
b) The size of the I/O matches the IU size but the I/O is not aligned to the
IU boundaries. I/O is unaligned.
c) I/O does not match in either size or LBA. I/O is unaligned.

what about d)? Not aligned to IU, I assume.




The idea of reporting alignment in a tracepoint was first suggested in
this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
tracing tool [2] was developed and used during LBS development, as
mentioned in the patch series [3] and in [1].

With this addition, users can check block alignment directly through the
block layer tracepoints without needing any additional tools.

In case we have a use case, this can be extended to other tracepoints,
such as complete and error.

Another potential enhancement could be the integration of this
information into blktrace. Would that be a feasible option to consider?

[1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@xxxxxxxxxxxxxxxxxxx/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
[2] blkalgn tool written in eBPF/bcc:
https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
[3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@xxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
---
block/blk-mq.c | 29 +++++++++++++++++++++++++++++
include/linux/blk-mq.h | 11 +++++++++++
include/linux/blkdev.h | 6 ++++++
include/trace/events/block.h | 7 +++++--
4 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 831c5cf5d874..714452bc236b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
}
EXPORT_SYMBOL_GPL(blk_rq_poll);
+u32 __blk_rq_lba_algn(struct request *req)

why use "algn", and not "align"?

"algn" is not a natural abbreviation of "alignment".

That's okay with me, changing the var name to a more natural abbreviation.


And why can't userspace figure this out? All the info is available already,
right?

We are interested in the block alignment (LBA and size) at block device driver
level, not userspace level. That is, everything that is going out from the block
layer. Using the block tracing points currently available makes it block-driver
generic.

I am just saying that the information already present in the block trace point can be used to get this "alignment" info, right? And userspace can do the work of reading those trace events to find this "alignment" info.

Thanks,
John