Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

From: Sarthak Kukreti
Date: Fri Sep 16 2022 - 14:48:57 EST


On Thu, Sep 15, 2022 at 11:10 PM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
>
> On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> > From: Sarthak Kukreti <sarthakkukreti@xxxxxxxxxxxx>
> >
> > Hi,
> >
> > This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.
> >
> > The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:
> >
> > 1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
> > 2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
> > 3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
> > 4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
> > a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
> > b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
> > c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
>
> When REQ_OP_PROVISION is sent on an already-allocated range of blocks,
> are those blocks zeroed? NVMe Write Zeroes with Deallocate=0 works this
> way, for example. That behavior is counterintuitive since the operation
> name suggests it just affects the logical block's provisioning state,
> not the contents of the blocks.
>
No, the blocks are not zeroed. The current implementation (in the dm
patch) is to indeed look at the provisioned state of the logical block
and provision if it is unmapped. if the block is already allocated,
REQ_OP_PROVISION should have no effect on the contents of the block.
Similarly, in the file semantics, sending an FALLOC_FL_PROVISION
requests for extents already mapped should not affect the contents in
the extents.

> > In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes.
>
> What exactly is the issue with WRITE_ZEROES scalability? Are you
> referring to cases where the device doesn't support an efficient
> WRITE_ZEROES command and actually writes blocks filled with zeroes
> instead of updating internal allocation metadata cheaply?
>
Yes. On ChromiumOS, we regularly deal with storage devices that don't
support WRITE_ZEROES or that need to have it disabled, via a quirk,
due to a bug in the vendor's implementation. Using WRITE_ZEROES for
allocation makes the allocation path quite slow for such devices (not
to mention the effect on storage lifetime), so having a separate
provisioning construct is very appealing. Even for devices that do
support an efficient WRITE_ZEROES implementation but don't support
logical provisioning per-se, I suppose that the allocation path might
be a bit faster (the device driver's request queue would report
'max_provision_sectors'=0 and the request would be short circuited
there) although I haven't benchmarked the difference.

Sarthak

> Stefan