Re: [dm-devel] [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

From: Stefan Hajnoczi
Date: Mon Sep 19 2022 - 12:37:08 EST


On Sat, Sep 17, 2022 at 12:46:33PM -0700, Sarthak Kukreti wrote:
> On Fri, Sep 16, 2022 at 8:03 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> >
> > On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> > > From: Sarthak Kukreti <sarthakkukreti@xxxxxxxxxxxx>
> > >
> > > Hi,
> > >
> > > This patch series is an RFC of a mechanism to pass through provision
> > > requests on stacked thinly provisioned storage devices/filesystems.
> >
> > [Reflowed text]
> >
> > > The linux kernel provides several mechanisms to set up thinly
> > > provisioned block storage abstractions (eg. dm-thin, loop devices over
> > > sparse files), either directly as block devices or backing storage for
> > > filesystems. Currently, short of writing data to either the device or
> > > filesystem, there is no way for users to pre-allocate space for use in
> > > such storage setups. Consider the following use-cases:
> > >
> > > 1) Suspend-to-disk and resume from a dm-thin device: In order to
> > > ensure that the underlying thinpool metadata is not modified during
> > > the suspend mechanism, the dm-thin device needs to be fully
> > > provisioned.
> > > 2) If a filesystem uses a loop device over a sparse file, fallocate()
> > > on the filesystem will allocate blocks for files but the underlying
> > > sparse file will remain intact.
> > > 3) Another example is virtual machine using a sparse file/dm-thin as a
> > > storage device; by default, allocations within the VM boundaries will
> > > not affect the host.
> > > 4) Several storage standards support mechanisms for thin provisioning
> > > on real hardware devices. For example:
> > > a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin
> > > provisioning: "When the THINP bit in the NSFEAT field of the
> > > Identify Namespace data structure is set to ‘1’, the controller ...
> > > shall track the number of allocated blocks in the Namespace
> > > Utilization field"
> > > b. The SCSi Block Commands reference - 4 section references "Thin
> > > provisioned logical units",
> > > c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
> > >
> > > In all of the above situations, currently the only way for
> > > pre-allocating space is to issue writes (or use
> > > WRITE_ZEROES/WRITE_SAME). However, that does not scale well with
> > > larger pre-allocation sizes.
> > >
> > > This patchset introduces primitives to support block-level
> > > provisioning (note: the term 'provisioning' is used to prevent
> > > overloading the term 'allocations/pre-allocations') requests across
> > > filesystems and block devices. This allows fallocate() and file
> > > creation requests to reserve space across stacked layers of block
> > > devices and filesystems. Currently, the patchset covers a prototype on
> > > the device-mapper targets, loop device and ext4, but the same
> > > mechanism can be extended to other filesystems/block devices as well
> > > as extended for use with devices in 4 a-c.
> >
> > If you call REQ_OP_PROVISION on an unmapped LBA range of a block device
> > and then try to read the provisioned blocks, what do you get? Zeroes?
> > Random stale disk contents?
> >
> > I think I saw elsewhere in the thread that any mapped LBAs within the
> > provisioning range are left alone (i.e. not zeroed) so I'll proceed on
> > that basis.
> >
> For block devices, I'd say it's definitely possible to get stale data, depending
> on the implementation of the allocation layer; for example, with dm-thinpool,
> the default setting via using LVM2 tools is to zero out blocks on allocation.
> But that's configurable and can be turned off to improve performance.
>
> Similarly, for actual devices that end up supporting thin provisioning, unless
> the specification absolutely mandates that an LBA contains zeroes post
> allocation, some implementations will definitely miss out on that (probably
> similar to the semantics of discard_zeroes_data today). I'm operating under
> the assumption that it's possible to get stale data from LBAs allocated using
> provision requests at the block layer and trying to see if we can create a
> safe default operating model from that.

Please explain the semantics of REQ_OP_PROVISION in the
code/documentation in the next revision.

Thanks,
Stefan

Attachment: signature.asc
Description: PGP signature