Re: [PATCH v10 6/8] rust: file: add `FileDescriptorReservation`

From: Al Viro
Date: Sun Sep 15 2024 - 18:02:11 EST



> What happens if I call `get_unused_fd_flags`, and then never call
> `put_unused_fd`? Assume that I don't try to use the fd in the future,
> and that I just forgot to clean up after myself.

Moderate amount of bogosities while the thread exists. For one thing,
descriptor is leaked - for open() et.al. it will look like it's in use.
For close() it will look like it's _not_ open (i.e. trying to close it
from userland will quietly do nothing). Trying to overwrite it with
dup2() will keep failing with -EBUSY.

Kernel-side it definitely violates assertions, but currently nothing
will break. Might or might not remain true in the future. Doing that
again and again would lead to inflated descriptor table, but then
so would dup2(0, something_large).

IOW, it would be a bug, but it's probably not going to be high impact
security hole.

> > + execve() at the point of no return (in begin_new_exec()).
execve(2), sorry.
> > That's the only place where violation of that constraint on some later
> > changes is plausible. That one needs to be watched out for.

> Thanks for going through that.

I'm in the middle of writing documentation on the descriptor table and
struct file handling right now anyway...

> From a Rust perspective, it sounds
> easiest to just declare that execve() is an unsafe operation, and that
> one of the conditions for using it is that you don't hold any fd
> reservations. Trying to encode this in the fd reservation logic seems
> too onerous, and I'm guessing execve is not used particularly often
> anyway.

Sorry, bad editing on my part - I should've clearly marked execve() as a
syscall. It's not that it's an unsafe operation - it's only called from
userland anyway, so it's not going to happen in scope of any reserved
descriptor.

The problem is different:

* userland calls execve("/usr/bin/whatever", argv, envp)

* that proceeds through several wrappers to do_execveat_common().
There are several syscalls in that family and they all converge
to do_execveat_common() - wrappers just deal with difference
in calling conventions.

* do_execveat_common() set up exec context (struct linux_binprm).
That opens the binary to be executed, creates memory context
to be used, calculates argc, sets up argv and envp on what will
become the userland stack for the new program, etc. - basically,
all the work for marshalling the data from caller's memory.
Then it calls bprm_execve(), which is where the rest of the work
will be done.

* bprm_execve() eventually calls exec_binprm(). That calls
search_binary_handler(), which is where we finally get a look
at the binary we are trying to load. search_binary_handler()
goes through the known binary formats (ELF, script, etc.)
and tries to offer the exec context to ->load_binary() of
each.

* ->load_binary() instance looks at the binary (starting with the
first 256 bytes read for us by prepare_binprm() called in the
beginning of search_binary_handler()). If it doesn't have the
right magic values, ->load_binary() just returns -ENOEXEC,
so that the next format would be tried. If it *does* look like
something this format is going to deal with, more sanity checks
are done, things are set up, etc. - details depend upon the
binary format in question. See load_elf_binary() for some taste
of that. Eventually it decides to actually discard the old
memory and switch to new binary. Up to that point it can
return an error - -ENOEXEC for soft ones ("not mine, after all,
try other formats"), something like -EINVAL/-ENOMEM/-EPERM/-EIO/etc.
for hard ones ("fail execve(2) with that error"). _After_ that
point we have nothing to return to; old binary is not mapped
anymore, userland stack frame is gone, etc. Any errors past
that point are treated as "kill me".

At the point of no return we call begin_new_exec(). That
kills all other threads, unshares descriptor table in case it
had been shared wider than the thread group, switches memory
context, etc.

Once begin_new_exec() is done, we can safely map whatever we
want to map, handle relocations, etc. Among other things,
we modify the userland register values saved on syscall entry,
so that once we return from syscall we'll end up with the
right register state at the right userland entry point, etc.
If everything goes fine, ->load_binary() returns 0 into
search_binary_handler() and we are pretty much done - some
cleanups on the way out and off to the loaded binary.

Alternatively, we may decide to mangle the exec context -
that's what #! handling does (see load_script() - basically
it opens the interpreter and massages the arguments, so
that something like debian/rules build-arch turns into
/usr/bin/make -f debian/rules build-arch and tells the
caller to deal with that; this is what the loop in
search_binary_handler() is about).

There's not a lot of binary formats (5 of those currently -
all in fs/binmt_*.c), but there's nothing to prohibit more
of them. If somebody decides to add the infrastructure for
writing those in Rust, begin_new_exec() wrapper will need
to be documented as "never call that in scope of reserved
descriptor". Maybe by marking that wrapper unsafe and
telling the users about the restriction wrt descriptor
reservations, maybe by somehow telling the compiler to
watch out for that - or maybe the constraint will be gone
by that time.

In any case, the underlying constraint ("a thread with
reserved descriptors should not try to get a private
descriptor table until all those descriptors are disposed
of one way or another") needs to be documented.