Re: [PATCH RFC] block: trace: add block alignment information

From: Daniel Gomez
Date: Fri Sep 13 2024 - 10:01:03 EST


On Fri, Sep 13, 2024 at 01:08:34PM +0100, John Garry wrote:
> On 13/09/2024 12:26, Daniel Gomez wrote:
> > On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
> > > On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> > > > From: Daniel Gomez <da.gomez@xxxxxxxxxxx>
> > > >
> > > > Report block alignment in terms of LBA and size during block tracing for
> > > > block_rq. Calculate alignment only for read/writes where the length is
> > > > greater than 0. Otherwise, report 0 to indicate no alignment calculated.
> > > >
> > > > Suggested-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > Signed-off-by: Daniel Gomez <da.gomez@xxxxxxxxxxx>
> > > > ---
> > > > This patch introduces LBA and size alignment information for
> > > > the block_rq tracepoints (block_rq{insert, issue, merge} and
> > > > block_{io_start, io_done}).
> > >
> > > eh, shouldn't this belong in the description of the patch?
> >
> > Yes. I'll move this to the commit message.
> >
> > >
> > > And I still don't know what we mean by alignment in this context.
> > >
> > > From looking at the code, it seems to be the max detected block size
> > > granularity. For example, for a 64KB write at a 32KB offset, that would give
> > > a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
> > > alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
> > > 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2
>
> note: I meant "8KB is the largest power-of-2"

8KB will be the largest unit at what a device can operate at, for that
particular I/O.

>
> > > which is divisible into 24KB. Is this a correct understanding?
> >
> > That is correct.
>
> So maybe it's me, but I just find it odd to call this information
> "alignment". To me, what you are looking for is largest block size
> granularity.

More suggestions are welcome. What about just I/O granularity? Does the term
imply LBA and size?

>
> > Do you think adding examples like yours can help to explain
> > this better?
> > Below the same examples using fio with the trace output:
> >
> >
> > sudo fio -bs=64k -size=64k -offset=32k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > sudo fio -bs=64k -size=64k -offset=64k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > sudo fio -bs=8k -size=8k -offset=64k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > sudo fio -bs=24k -size=24k -offset=0k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > fio-789 [000] ..... 4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
> > fio-801 [000] ..... 4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
> > fio-813 [000] ..... 4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
> > fio-825 [000] ..... 4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
> >
> >
> > Also, the motivation behind this is explained in the LBS RFC [1] and I should
> > have included it here for context. I hope [1] and my description below helps to
> > explain what alignment means and why is needed:
> >
> > [1] Subject: [RFC 00/23] Enable block size > page size in XFS
> > https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@xxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$
> >
> > Tracing alignment information is important for high-capacity and QLC SSDs with
> > Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
> > Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
> > boundaries can imply in a read-modify-write (RMW).
>
> Yes, I get that this might be important to know.
>
> >
> > The graph below is a representation of the device IU vs an I/O block aligned/
> > unaligned.
> >
> > |--- IU Boundaries ----| |-PS-|
> > a) [====][====][====][====][····][····][····]--
> > | |
> > b) [····][====][====][====][====][····][····]--
> > | |
> > c) [====][====][====][====][····][====][====]--
>
> is there meant to be a gap at page index #4?

Sorry, that's a copy+paste error. c) can be ignored.

>
> > | |
> > d) [····][····][====][====][····][····][····]--

d) is c)


> > | |
> > LBA 0 4
> > Key:
> > [====] = I/O Block
> > [····] = Memory in Page Size (PS) chunks
> >
> > a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
> > b) The size of the I/O matches the IU size but the I/O is not aligned to the
> > IU boundaries. I/O is unaligned.
> > c) I/O does not match in either size or LBA. I/O is unaligned.
>
> what about d)? Not aligned to IU, I assume.

Yes, c) description is meant for d).

So for clarity, the correct graph is:

|--- IU Boundaries ----| |-PS-|
a) [====][====][====][====][····][····][····]--
| |
b) [····][====][====][====][====][····][····]--
| |
c) [····][····][====][====][····][····][····]--
| |
LBA 0 4

a) I/O matches IU boundaries (LBA and block size). I/O is aligned to IU boundaries.
b) The size of the I/O matches the IU size but the I/O is not aligned to the
IU boundaries. I/O is unaligned.
c) I/O does not match in either size or LBA. I/O is unaligned.

Using I/O granularity term:
a) 16k I/O granularity
b) 4k I/O granularity
c) 8k I/O granularity

>
> >
> > >
> > > >
> > > > The idea of reporting alignment in a tracepoint was first suggested in
> > > > this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> > > > tracing tool [2] was developed and used during LBS development, as
> > > > mentioned in the patch series [3] and in [1].
> > > >
> > > > With this addition, users can check block alignment directly through the
> > > > block layer tracepoints without needing any additional tools.
> > > >
> > > > In case we have a use case, this can be extended to other tracepoints,
> > > > such as complete and error.
> > > >
> > > > Another potential enhancement could be the integration of this
> > > > information into blktrace. Would that be a feasible option to consider?
> > > >
> > > > [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@xxxxxxxxxxxxxxxxxxx/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> > > > [2] blkalgn tool written in eBPF/bcc:
> > > > https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> > > > [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@xxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> > > > ---
> > > > block/blk-mq.c | 29 +++++++++++++++++++++++++++++
> > > > include/linux/blk-mq.h | 11 +++++++++++
> > > > include/linux/blkdev.h | 6 ++++++
> > > > include/trace/events/block.h | 7 +++++--
> > > > 4 files changed, 51 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 831c5cf5d874..714452bc236b 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> > > > }
> > > > EXPORT_SYMBOL_GPL(blk_rq_poll);
> > > > +u32 __blk_rq_lba_algn(struct request *req)
> > >
> > > why use "algn", and not "align"?
> > >
> > > "algn" is not a natural abbreviation of "alignment".
> >
> > That's okay with me, changing the var name to a more natural abbreviation.
> >
> > >
> > > And why can't userspace figure this out? All the info is available already,
> > > right?
> >
> > We are interested in the block alignment (LBA and size) at block device driver
> > level, not userspace level. That is, everything that is going out from the block
> > layer. Using the block tracing points currently available makes it block-driver
> > generic.
>
> I am just saying that the information already present in the block trace
> point can be used to get this "alignment" info, right? And userspace can do
> the work of reading those trace events to find this "alignment" info.

So, maybe this is better to have integrated in blktrace tool?

>
> Thanks,
> John
>