Re: [PATCH v2 2/6] composefs: Add on-disk layout

From: Alexander Larsson
Date: Mon Jan 16 2023 - 06:01:02 EST


On Mon, 2023-01-16 at 12:29 +1100, Dave Chinner wrote:
> On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson wrote:
> > This commit adds the on-disk layout header file of composefs.
>
> This isn't really a useful commit message.
>
> Perhaps it should actually explain what the overall goals of the
> on-disk format are - space usage, complexity trade-offs, potential
> issues with validation of variable payload sections, etc.
>

I agree, will flesh it out. But, as for below discussions, one of the
overall goals is to keep the on-disk file size low.

> > Signed-off-by: Alexander Larsson <alexl@xxxxxxxxxx>
> > Co-developed-by: Giuseppe Scrivano <gscrivan@xxxxxxxxxx>
> > Signed-off-by: Giuseppe Scrivano <gscrivan@xxxxxxxxxx>
> > ---
> >  fs/composefs/cfs.h | 203
> > +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 203 insertions(+)
> >  create mode 100644 fs/composefs/cfs.h
> >
> > diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
> > new file mode 100644
> > index 000000000000..658df728e366
> > --- /dev/null
> > +++ b/fs/composefs/cfs.h
> > @@ -0,0 +1,203 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * composefs
> > + *
> > + * Copyright (C) 2021 Giuseppe Scrivano
> > + * Copyright (C) 2022 Alexander Larsson
> > + *
> > + * This file is released under the GPL.
> > + */
> > +
> > +#ifndef _CFS_H
> > +#define _CFS_H
> > +
> > +#include <asm/byteorder.h>
> > +#include <crypto/sha2.h>
> > +#include <linux/fs.h>
> > +#include <linux/stat.h>
> > +#include <linux/types.h>
> > +
> > +#define CFS_VERSION 1
>
> This should start with a description of the on-disk format for the
> version 1 format.

There are some format descriptions in the later document patch. What is
the general approach here, do we document in the header, or in separate
doc file? For example, I don't see much of format descriptions in the
xfs headers. I mean, I should probably add *some* info here for easier
reading of the stuff below, but I don't feel like headers are a great
place for docs.

> > +
> > +#define CFS_MAGIC 0xc078629aU
> > +
> > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> > +#define CFS_MAX_XATTRS_SIZE 4096
>
> How do we store 64kB xattrs in this format if the max attr size is
> 4096 bytes? Or is that the maximum total xattr storage?

This is a current limitation of the composefs file format. I am aware
that the kernel maximum size is 64k, but I'm not sure what use this
would have in a read-only filesystem image in practice. I could extend
this limit with some added complextity, but would it be worth the
increase in complexity?

> A comment telling us what these limits are would be nice.
>

Sure.

> > +
> > +static inline int cfs_digest_from_payload(const char *payload,
> > size_t payload_len,
> > +                                         u8
> > digest_out[SHA256_DIGEST_SIZE])
> > +{
> > +       const char *p, *end;
> > +       u8 last_digit = 0;
> > +       int digit = 0;
> > +       size_t n_nibbles = 0;
> > +
> > +       /* This handles payloads (i.e. path names) that are
> > "essentially" a
> > +        * digest as the digest (if the DIGEST_FROM_PAYLOAD flag is
> > set). The
> > +        * "essential" part means that we ignore hierarchical
> > structure as well
> > +        * as any extension. So, for example "ef/deadbeef.file"
> > would match the
> > +        * (too short) digest "efdeadbeef".
> > +        *
> > +        * This allows images to avoid storing both the digest and
> > the pathname,
> > +        * yet work with pre-existing object store formats of
> > various kinds.
> > +        */
> > +
> > +       end = payload + payload_len;
> > +       for (p = payload; p != end; p++) {
> > +               /* Skip subdir structure */
> > +               if (*p == '/')
> > +                       continue;
> > +
> > +               /* Break at (and ignore) extension */
> > +               if (*p == '.')
> > +                       break;
> > +
> > +               if (n_nibbles == SHA256_DIGEST_SIZE * 2)
> > +                       return -EINVAL; /* Too long */
> > +
> > +               digit = hex_to_bin(*p);
> > +               if (digit == -1)
> > +                       return -EINVAL; /* Not hex digit */
> > +
> > +               n_nibbles++;
> > +               if ((n_nibbles % 2) == 0)
> > +                       digest_out[n_nibbles / 2 - 1] = (last_digit
> > << 4) | digit;
> > +               last_digit = digit;
> > +       }
> > +
> > +       if (n_nibbles != SHA256_DIGEST_SIZE * 2)
> > +               return -EINVAL; /* Too short */
> > +
> > +       return 0;
> > +}
>
> Too big to be a inline function.
>

Yeah, I'm aware of this. I mainly put it in the header as the
implementation of it is sort of part of the on-disk format. But, I can
move it to a .c file instead.


> > +
> > +struct cfs_vdata_s {
>
> Drop the "_s" suffix to indicate the type is a structure - that's
> waht "struct" tells us.

Sure.

> > +       u64 off;
> > +       u32 len;
>
> If these are on-disk format structures, why aren't the defined as
> using the specific endian they are encoded in? i.e. __le64, __le32,
> etc? Otherwise a file built on a big endian machine won't be
> readable on a little endian machine (and vice versa).

On disk all fields are little endian. However, when we read them from
disk we convert them using e.g. le32_to_cpu(), and then we use the same
structure in memory, with native endian. So, it seems wrong to mark
them as little endian.

>
> > +} __packed;
> > +
> > +struct cfs_header_s {
> > +       u8 version;
> > +       u8 unused1;
> > +       u16 unused2;
>
> Why are you hyper-optimising these structures for minimal space
> usage? This is 2023 - we can use a __le32 for the version number,
> the magic number and then leave....
>
> > +
> > +       u32 magic;
> > +       u64 data_offset;
> > +       u64 root_inode;
> > +
> > +       u64 unused3[2];
>
> a whole heap of space to round it up to at least a CPU cacheline
> size using something like "__le64 unused[15]".
>
> That way we don't need packed structures nor do we care about having
> weird little holes in the structures to fill....

Sure.

> > +} __packed;
> > +
> > +enum cfs_inode_flags {
> > +       CFS_INODE_FLAGS_NONE = 0,
> > +       CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> > +       CFS_INODE_FLAGS_MODE = 1 << 1,
> > +       CFS_INODE_FLAGS_NLINK = 1 << 2,
> > +       CFS_INODE_FLAGS_UIDGID = 1 << 3,
> > +       CFS_INODE_FLAGS_RDEV = 1 << 4,
> > +       CFS_INODE_FLAGS_TIMES = 1 << 5,
> > +       CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> > +       CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of st_size
> > */
> > +       CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of
> > st_size */
>
> Why do we need to complicate things by splitting the inode size
> like this?
>

The goal is to minimize the image size for a typical rootfs or
container image. Almost zero files in any such images are > 4GB. 

Also, we don't just "not decode" the items with the flag not set, they
are not even stored on disk.

> > +       CFS_INODE_FLAGS_XATTRS = 1 << 9,
> > +       CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256
> > digest */
> > +       CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /* Compute
> > digest from payload */
> > +};
> > +
> > +#define CFS_INODE_FLAG_CHECK(_flag,
> > _name)                                     \
> > +       (((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)
>
> Check what about a flag? If this is a "check that a feature is set",
> then open coding it better, but if you must do it like this, then
> please use static inline functions like:
>
>         if (cfs_inode_has_xattrs(inode->flags)) {
>                 .....
>         }
>

The check is if the flag is set, so maybe CFS_INODE_FLAG_IS_SET is a
better name. This is used only when decoding the on-disk version of the
inode to the in memory one, which is a bunch of:

if (CFS_INODE_FLAG_CHECK(ino->flags, THE_FIELD))
ino->the_field = cfs_read_u32(&data);
else
ino->the_field = THE_FIELD_DEFUALT;

I can easily open-code these checks, although I'm not sure it makes a
great difference either way.

> > +#define CFS_INODE_FLAG_CHECK_SIZE(_flag, _name,
> > _size)                         \
> > +       (CFS_INODE_FLAG_CHECK(_flag, _name) ? (_size) : 0)
>
> This doesn't seem particularly useful, because you've still got to
> test is the return value is valid. i.e.
>
>         size = CFS_INODE_FLAG_CHECK_SIZE(inode->flags, XATTRS, 32);
>         if (size == 32) {
>                 /* got xattrs, decode! */
>         }
>
> vs
>         if (cfs_inode_has_xattrs(inode->flags)) {
>                 /* decode! */
>         }

This macro is only uses by the cfs_inode_encoded_size() function that
computes the size of the on-disk format of an inode, given its flags:

static inline u32 cfs_inode_encoded_size(u32 flags)
{
return sizeof(u32) /* flags */ +
CFS_INODE_FLAG_CHECK_SIZE(flags, PAYLOAD, sizeof(u32))
+
CFS_INODE_FLAG_CHECK_SIZE(flags, MODE, sizeof(u32)) +
CFS_INODE_FLAG_CHECK_SIZE(flags, NLINK, sizeof(u32)) +
...

It is only useful in the sense that it makes this function easy to
read/write. I should maybe move the definion of the macro to that
function.

>
> > +
> > +#define CFS_INODE_DEFAULT_MODE 0100644
> > +#define CFS_INODE_DEFAULT_NLINK 1
> > +#define CFS_INODE_DEFAULT_NLINK_DIR 2
> > +#define CFS_INODE_DEFAULT_UIDGID 0
> > +#define CFS_INODE_DEFAULT_RDEV 0
> > +#define CFS_INODE_DEFAULT_TIMES 0
>
> Where do these get used? Are they on disk defaults or something
> else? (comment, please!)

They are the defaults that are used when inode fields on disk are
missing. I'll add some comments.

> > +struct cfs_inode_s {
> > +       u32 flags;
> > +
> > +       /* Optional data: (selected by flags) */
>
> WHy would you make them optional given that all the fields are still
> defined in the structure?
>
> It's much simpler just to decode the entire structure into memory
> than to have to check each flag value to determine if a field needs
> to be decoded...
>

I guess I need to clarify these comments a bit, but they are optional
on-disk, and decoded and extended with the above defaults by
cfs_get_ino_index() when read into memory. So, they are not optional in
memory.


> > +       /* This is the size of the type specific data that comes
> > directly after
> > +        * the inode in the file. Of this type:
> > +        *
> > +        * directory: cfs_dir_s
> > +        * regular file: the backing filename
> > +        * symlink: the target link
> > +        *
> > +        * Canonically payload_length is 0 for empty
> > dir/file/symlink.
> > +        */
> > +       u32 payload_length;
>
> How do you have an empty symlink?

In terms of the file format, empty would mean a zero length target
string. But you're right that this isn't allowed. I'll change this
comment.

> > +       u32 st_mode; /* File type and mode.  */
> > +       u32 st_nlink; /* Number of hard links, only for regular
> > files.  */
> > +       u32 st_uid; /* User ID of owner.  */
> > +       u32 st_gid; /* Group ID of owner.  */
> > +       u32 st_rdev; /* Device ID (if special file).  */
> > +       u64 st_size; /* Size of file, only used for regular files
> > */
> > +
> > +       struct cfs_vdata_s xattrs; /* ref to variable data */
>
> This is in the payload that follows the inode?  Is it included in
> the payload_length above?
>
> If not, where is this stuff located, how do we validate it points to
> the correct place in the on-disk format file, the xattrs belong to
> this specific inode, etc? I think that's kinda important to
> describe, because xattrs often contain important security
> information...

No, all inodes are packed into the initial part of the file, each
containing a flags set, a variable size (from flags) chunk of fixed
size elements and an variable size payload. The payload is either the
target symlink for symlinks, or the path of the backing file for
regular files. Other data, such as xattrs and dirents are stored in a
separate part of the file and the offsets for those in the inode refer
to offsets into that area.

>
> > +
> > +       u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */
>
> Why would you have this in the on-disk structure, then also have
> "digest from payload" that allows the digest to be in the payload
> section of the inode data?

The payload is normally the path to the backing file, and then you need
to store the verity digest separately. This is what would be needed
when using this with ostree for instance, because we have an existing
backing file repo format we can't change. However, if your backing
store files are stored by their fs-verity digest already (which is the
default for mkcomposefs), then we can set this flag and avoid storing
the digest unnecessary.

> > +
> > +       struct timespec64 st_mtim; /* Time of last modification. 
> > */
> > +       struct timespec64 st_ctim; /* Time of last status change. 
> > */
> > +};
>
> This really feels like an in-memory format inode, not an on-disk
> format inode, because this:
>
> > +
> > +static inline u32 cfs_inode_encoded_size(u32 flags)
> > +{
> > +       return sizeof(u32) /* flags */ +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, PAYLOAD,
> > sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, MODE, sizeof(u32))
> > +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, NLINK, sizeof(u32))
> > +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, UIDGID, sizeof(u32)
> > + sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, RDEV, sizeof(u32))
> > +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES, sizeof(u64)
> > * 2) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES_NSEC,
> > sizeof(u32) * 2) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, LOW_SIZE,
> > sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, HIGH_SIZE,
> > sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, XATTRS, sizeof(u64)
> > + sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, DIGEST,
> > SHA256_DIGEST_SIZE);
> > +}
>
> looks like the on-disk format is an encoded format hyper-optimised
> for minimal storage space usage?

Yes.


> Without comments to explain it, I'm not exactly sure what is stored
> in the on-disk format inodes, nor the layout of the variable
> payload section or how payload sections are defined and verified.
>
> Seems overly complex to me - it's far simpler just to have a fixed
> inode structure and just decode it directly into the in-memory
> structure when it is read....

We have a not-fixed-size on disk inode structure (for size reasons)
which we decode directly into the above in-memory structure when read.
So I don't think we're that far from what you expect. However, yes,
this could easily be explained better.

>
> > +struct cfs_dentry_s {
> > +       /* Index of struct cfs_inode_s */
>
> Not a useful (or correct!) comment :/

Its not really incorrect, but I agree its not neccessary a great
comment. At this specific offset in the inode section we can decode the
cfs_inode_s that this inode refers to, and his offset is also the inode
number of the inode.

> Also, the typical term for this on disk structure in a filesystem is
> a "dirent", and this is also what readdir() returns to userspace.
> dentry is typically used internally in the kernel to refer to the
> VFS cache layer objects, not the filesystem dirents the VFS layers
> look up to populate it's dentry cache.
>

Yeah, i'll rename it.

> > +       u64 inode_index;
> > +       u8 d_type;
> > +       u8 name_len;
> > +       u16 name_offset;
>
> What's this name_offset refer to?

Dirents are stored in chunks, each chunk < 4k. These chunks are a list
of these dirents, followed by the strings for the names, the
name_offset is the offset from the start of the chunk to the name.

> > +} __packed;
> > +
> > +struct cfs_dir_chunk_s {
> > +       u16 n_dentries;
> > +       u16 chunk_size;
> > +       u64 chunk_offset;
>
> What's this chunk offset refer to?
>

This is the offset in the "variable data" section of the image. This
section follows the packed inode data section. Again, better comments
needed.

> > +} __packed;
> > +
> > +struct cfs_dir_s {
> > +       u32 n_chunks;
> > +       struct cfs_dir_chunk_s chunks[];
> > +} __packed;
>
> So directory data is packed in discrete chunks? Given that this is a
> static directory format, and the size of the directory is known at
> image creation time, why does the storage need to be chunked?

We chunk the data such that each chunk fits inside a single page in the
image file. I did this to make accessing image data directly from the
page cache easier. We can just kmap_page_local() each chunk and treat
it as a non-split continuous dirent array, then move on to the next
chunk in the next page. If we had dirent data spanning multiple pages
then we would either need to map the pages consecutively (which seems
hard/costly) or have complex in-kernel code to handle the case where a
dirent straddles two pages.

> > +
> > +#define
> > cfs_dir_size(_n_chunks)                                            
> >     \
> > +       (sizeof(struct cfs_dir_s) + (_n_chunks) * sizeof(struct
> > cfs_dir_chunk_s))
>
> static inline, at least.
>
> Also, this appears to be the size of the encoded directory
> header, not the size of the directory itself. cfs_dir_header_size(),
> perhaps, to match the cfs_xattr_header_size() function that does the
> same thing?

Yeah, that makes sense.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
Alexander Larsson Red Hat,
Inc
alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx
He's a suicidal hunchbacked cyborg who knows the secret of the alien
invasion. She's a time-travelling out-of-work snake charmer with a song
in her heart and a spring in her step. They fight crime!