On Tue, May 23, 2023 at 7:29 PM Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
On Fri, May 19 2023 at 6:27P -0400,
Du Rui <durui@xxxxxxxxxxxxxxxxx> wrote:
OverlayBD is a novel layering block-level image format, which is design
for container, secure container and applicable to virtual machine,
published in USENIX ATC '20
https://www.usenix.org/system/files/atc20-li-huiba.pdf
OverlayBD already has a ContainerD non-core sub-project implementation
in userspace, as an accelerated container image service
https://github.com/containerd/accelerated-container-image
It could be much more efficient when do decompressing and mapping works
in the kernel with the framework of device-mapper, in many circumstances,
such as secure container runtime, mobile-devices, etc.
This patch contains a module, dm-overlaybd, provides two kinds of targets
dm-zfile and dm-lsmt, to expose a group of block-devices contains
OverlayBD image as a overlaid read-only block-device.
Signed-off-by: Du Rui <durui@xxxxxxxxxxxxxxxxx>
<snip, original patch here: [1] >
A long long time ago I wrote a docker container image based on
dm-snapshot that is vaguely similar to this one. It is still
available, but nobody really uses it. It has several weaknesses. First
of all the container image is an actual filesystem, so you need to
pre-allocate a fixed max size for images at construction time.
Secondly, all the lvm volume changes and mounts during runtime caused
weird behaviour (especially at scale) that was painful to manage (just
search the docker issue tracker for devmapper backend). In the end
everyone moved to a filesystem based implementation (overlayfs based).
I appreciate that this work is being done with an eye toward
containerd "community" and standardization but based on my limited
research it appears that this format of OCI image storage/use is only
used by Alibaba? (but I could be wrong...)
But you'd do well to explain why the userspace solution isn't
acceptable. Are there security issues that moving the implementation
to kernel addresses?
I also have doubts that this solution is _actually_ more performant
than a proper filesystem based solution that allows page cache sharing
of container image data across multiple containers.
This solution doesn't even allow page cache sharing between shared
layers (like current containers do), much less between independent
layers.
There is an active discussion about, and active development effort
for, using overlayfs + erofs for container images. I'm reluctant to
merge this DM based container image approach without wider consensus
from other container stakeholders.
But short of reaching wider consensus on the need for these DM
targets: there is nothing preventing you from carrying these changes
in your alibaba kernel.
Erofs already has some block-level support for container images (with
nydus), and composefs works with current in-kernel EROFS+overlayfs.
And this new approach doesn't help for the IMHO current weak spot we
have, which is unprivileged container images.
Also, while OCI artifacts can be used to store any kind of image
formats (or any other kind of file) I think for an actual standardized
new image format it would be better to work with the OCI org to come
up with a OCI v2 standard image format.
But, I don't really speak for the block layer developers, so take my
opinions with a pinch of salt.