Re: Re: dm overlaybd: targets mapping OverlayBD image

From: Du Rui
Date: Fri May 26 2023 - 06:25:58 EST


Hi Mike:

> I appreciate that this work is being done with an eye toward
> containerd "community" and standardization

> it appears that this format of OCI image storage/use is only
> used by Alibaba?

> But you'd do well to explain why the userspace solution isn't
> acceptable.

Yes overlaybd has origins in container community, but this work (kernel
modules) does *NOT* actually target at container. Because on-demand lazy
loading of container images involves complex interactions with the image
registry through HTTP(s) protocol, and possibly with other transport
serivces (like HTTP proxy, sock5 proxy, P2P, cache, etc.). This is better
implemented in user-space and finally exported to kernel as a virtual
block device like TCMU or ublk. The user-space impl of Overlaybd has a
very large install base in Alibaba, as well as some other big companies,
including another major cloud provider. (We'd better not unveil their
names before we get their permissions). And We are pleased with the
flexibility in user-space that allows for easy integration to various
systems / environments.

We implement this kernel module and try to contribute it to upstream
because we belive it is useful for device mapper and LVM ecology:

(1) dm-overlaybd essentially implements generic redistributable snapshot
of an block device. This may enable LVM to push/pull individual
snapshots to/from a volume repo globally distributed.

(2) dm-overlaybd is highly efficent. Its index performance doesn't degrade
with the number of snapshots increasing. In constrast, qcow2 (dm-qcow2)
do not support efficient external snapshots. It has O(n) overhead in
this case, where n is the number of (backing-file) snapshots.

(3) dm-zfile is an efficient generic compressed block device. This allows
LVM to support compressed snapshot, in order to save disk space without
compromise much performance, and may even improve performance in some
cases.


> I also have doubts that this solution is _actually_ more performant
> than a proper filesystem based solution

This proposal is not focused on performance, it's focused on new features
to dm and LVM as described above, but I still advice you to run benchmarks
and see the results. After all, ext4, xfs and other mature file systems are
highly optimized as well.

> solution that allows page cache sharing

Page cache sharing can be realized with DAX support of the dm targets
(and the inner file system), together with virtual pmem device backend.

> There is an active discussion about, and active development effort
> for, using overlayfs + erofs for container images. I'm reluctant to
> merge this DM based container image approach without wider consensus
> from other container stakeholders.

This proposal intends to help dm and lvm ecology, and is not related to
those file systems. It actually supports all kinds of file systems with
full capabilities. It is of little use in container, as the user-space
implementation is more feasible. And, there is nothing preventing the
container stakeholders to continue discussing and developing overlayfs,
erofs, composefs, etc.