Re: file metadata via fs API

From: Lennart Poettering
Date: Fri Aug 14 2020 - 03:58:45 EST

On Mi, 12.08.20 12:50, Linus Torvalds (torvalds@xxxxxxxxxxxxxxxxxxxx) wrote:

> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote:
> >
> > The point of this is to give us the ability to monitor mounts from
> > userspace.
> We haven't had that before, I don't see why it's suddenly such a big deal.
> The notification side I understand. Polling /proc files is not the answer.
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
> It really smells like "do it because we can, not because we must".

With my systemd maintainer hat on (and of other userspace stuff),
there's a couple of things I really want from the kernel because it
would fix real problems for us:

1. we want mount notifications that don't require to scan
/proc/self/mountinfo entirely again every time things change, over
and over again, simply because that doesn't scale. We have various
bugs open about this performance bottleneck, I could point you to,
but I figure it's easy to see why this currently doesn't scale...

2. We want an unpriv API to query (and maybe set) the fs UUID, like we
have nowadays for the fs label FS_IOC_[GS]ETFSLABEL

3. We want an API to query time granularity of file systems
timestamps. Otherwise it's so hard in userspace to reproducibly
re-generate directory trees. We need to know for example that some
fs only has 2s granularity (like fat).

4. Similar, we want to know if an fs is case-sensitive for file
names. Or case-preserving. And which charset it accepts for filenames.

5. We want to know if a file system supports access modes, xattrs,
file ownership, device nodes, symlinks, hardlinks, fifos, atimes,
btimes, ACLs and so on. All these things currently can only be
figured out by changing things and reading back if it worked. Which
sucks hard of course.

6. We'd like to know the max file size on a file system.

7. Right now it's hard to figure out mount options used for the fs
backing some file: you can now statx() the file, determine the
mnt_id by that, and then search that in /proc/self/mountinfo, but
it's slow, because again we need to scan the whole file until we
find the entry we need. And that can be huge IRL.

8. Similar: we quite often want to know submounts of a mount. It would
be great if for that kind of information (i.e. list of mnt_ids
below some other mnt_id) we wouldn't have to scan the whole of
/p/s/mi again. In many cases in our code we operate recursively,
and want to know the mounts below some specific dir, but currently
pay performance price for it if the number of file systems on the
host is huge. This doesn't sound like a biggie, but actually is a
biggie. In systemd we spend a lot of time scaninng /p/s/mi...

9. How are file locks implemented on this fs? Are they local only, and
orthogonal to remote locks? Are POSIX and BSD locks possibly merged
at the backend? Do they work at all?

I don't really care too much how an API for this looks like, but let
me just say that I am not a fan of APIs that require allocating an fd
for querying info about an fd. This 'feels' a bit too recursive: if
you expose information about some fd in some magic procfs subdir, or
even in some virtual pseudo-file below the file's path then this means
we have to allocate a new fd to figure out things or the first fd, and
if we'd know the same info for that, we'd theoretically recurse
down. Now of course, most likely IRL we wouldn't actually recurse down,
but it is still smelly. In particular if fd limits are tight. I mean,
I really don't care if you expose non-file-system stuff via the fs, if
that's what you want, but I think exposing *fs* metainfo in the *fs*,
it's just ugly.

I generally detest APIs that have no chance to ever returning multiple
bits of information atomically. Splitting up querying of multiple
attributes into multiple system calls means they couldn't possibly be
determined in a congruent way. I much prefer APIs where we provide a
struct to fill in and do a single syscall, and at least for some
fields we'd know afterwards that the fields were filled in together
and are congruent with each other.

I am a fan of the statx() system call I must say. If we had something
like this for the file system itself I'd be quite happy, it could tick
off many of the requests I list above.

Hope this is useful,


Lennart Poettering, Berlin