Re: [PATCH v2] block: ublk: enable zoned storage support

From: Ming Lei
Date: Fri Mar 03 2023 - 06:48:46 EST


On Fri, Mar 03, 2023 at 09:27:58AM +0100, Andreas Hindborg wrote:
>
> Ming Lei <ming.lei@xxxxxxxxxx> writes:
>
> > On Thu, Mar 02, 2023 at 02:28:33PM +0100, Andreas Hindborg wrote:
> >>
> >> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> >>
> >> > On Thu, Mar 02, 2023 at 11:07:15AM +0100, Andreas Hindborg wrote:
> >> >>
> >> >> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> >> >>
> >> >> > On Thu, Mar 2, 2023 at 5:02 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> >> >> >>
> >> >> >> On Thu, Mar 02, 2023 at 04:32:21PM +0800, Ming Lei wrote:
> >> >> >> > On Thu, Mar 02, 2023 at 08:31:07AM +0100, Andreas Hindborg wrote:
> >> >> >> > >
> >> >> >>
> >> >> >> ...
> >> >> >>
> >> >> >> > >
> >> >> >> > > I agree about fetching more zones. However, it is no good to fetch up to
> >> >> >> > > a max, since the requested zone report may less than max. I was
> >> >> >> >
> >> >> >> > Short read should always be supported, so the interface may need to
> >> >> >> > return how many zones in single command, please refer to nvme_ns_report_zones().
> >> >> >>
> >> >> >> blk_zone is part of uapi, maybe the short read can be figured out by
> >> >> >> one all-zeroed 'blk_zone'? then no extra uapi data is needed for
> >> >> >> reporting zones.
> >> >> >
> >> >> > oops, we have blk_zone_report data for reporting zones to userspace already,
> >> >> > see blkdev_report_zones_ioctl(), then this way can be re-used for getting zone
> >> >> > report from ublk server too, right?
> >> >>
> >> >> Yes that would be nice. But I did the report_zone command like a read
> >> >> operation, so we are not currently copying any buffers to user space
> >> >> when issuing the command, we just rely on the iod.
> >> >
> >> > What I meant is to reuse the format of blk_zone_report for returning
> >> > multiple 'blk_zone' info in single command.
> >> >
> >> > The only change is that you need to allocate one bigger kernel buffer
> >> > to hold more 'blk_zone' in single report zone request.
> >> >
> >> >> I think it would be
> >> >> better to use the start_sectors and nr_sectors of the iod instead. Then
> >> >> we don't have to copy the blk_zone_report. What do you think?
> >> >
> >> > For IN parameter of report zone command, you still can reuse
> >> > blk_zone_report:
> >> >
> >> > struct blk_zone_report {
> >> > __u64 sector;
> >> > __u32 nr_zones;
> >> > __u32 flags;
> >> > };
> >> >
> >> > Just by using the 1st two 64b words of iod for holding 'blk_zone_report', and
> >> > keep the iod->addr field not touched.
> >>
> >> I see. Would you make the first part of `struct ublksrv_io_desc` a union
> >> for this, or would you just cast it at the use site?
> >
> > oops, you still need iod->op_flags for recognizing the io op, so just
> > start_sector and nr_sectors can be used.
>
> We do not actually need to pass the flags to user space, or back from
> user space to kernel for ublk zone report. They are currently used to
> tell user space if the zone report contains capacity field. We could
> exclude them from the ublk kabi since the zone report will always
> contain capacity? But it might be good to have a flags field or future
> things.
>
> > However, this way isn't good too, cause UBLK_IO_OP_DRV_IN is just mapped
> > to 'report zone' command in your implementation, what if new pt request
> > is required in future?
>
> We are currently mapping REQ_OP_* 1:1 to UBLK_OP_OP_*. If we relax
> this, we can have a UBLK_IO_OP_REPORT_ZONES.

The op takes 8 bits, and enough to cover all normal block layer OPs and
driver specific OPs, so I'd suggest this way, and ublk device
specific OP can be started from 32, prefixed with

UBLK_IO_OP_DRV_IN //[32, 96)
or
UBLK_IO_OP_DRV_OUT //[96, 160)

such as, report zones can be defined as

enum {
__UBLK_IO_OP_DRV_IN_START = 32,
UBLK_IO_OP_DRV_IN_REPORT_ZONES = __UBLK_IO_OP_DRV_IN_START,
__UBLK_IO_OP_DRV_IN_END = 96,

__UBLK_IO_OP_DRV_OUT_START = __UBLK_IO_OP_DRV_IN_END,
__UBLK_IO_OP_DRV_OUT_END = 160,
};

For any DRV OPs, iod header(not include ->addr) and buffer format can be re-defined
as uapi structure.

What do you think of this way?

>
> >
> > We need to think about how to support ublk pt request in generic way.
>
> Another option is to allow REQ_OP_DRV_IN to pass a buffer to user space.
> Instead of being similar to a read operation, it could be a combination of
> a read and a write operation.

That might be more flexible, but could add driver & userspace's
complexity, so I'd suggest to try to avoid bidirectional buffer asap,
but we still reserve support for it via UBLK_IO_OP_DRV_IN_OUT*.

Thanks,
Ming