Re: [PATCH v10 6/8] rust: file: add `FileDescriptorReservation`

From: Al Viro
Date: Sun Sep 15 2024 - 15:35:32 EST


On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote:
> 2) calling thread MUST NOT unshare descriptor table while it has
> any reserved descriptors. I.e.
> fd = get_unused_fd();
> unshare_files();
> fd_install(fd, file);
> is a bug. Reservations are discarded by that. Getting rid of that
> constraint would require tracking the sets of reserved descriptors
> separately for each thread that happens to share the descriptor table.
> Conceptually they *are* per-thread - the same thread that has done
> reservation must either discard it or use it. However, it's easier to
> keep the "it's reserved by some thread" represented in descriptor table
> itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
> NULL) than try and keep track of who's reserved what. The constraint is
> basically "all reservations can stay with the old copy", i.e. "caller has
> no reservations of its own to transfer into the new private copy it gets".

FWIW, I toyed with the idea of having reservations kept per-thread;
it is possible and it simplifies some things, but I hadn't been able to
find a way to do that without buggering syscall latency for open() et.al.

It would keep track of the set of reservations in task_struct (count,
two-element array for the first two + page pointer for spillovers,
for the rare threads that need more than two reserved simultaneously).

Representation in fdtable:
state open_fds bit value in ->fd[] array
free clear 0UL
reserved set 0UL
uncommitted set 1UL|(unsigned long)file
open set (unsigned long)file

with file lookup treating any odd value as 0 (i.e. as reserved)
fd_install() switching reserved to uncommitted *AND* separate
"commit" operation that does this:
if current->reservation_count == 0
return
if failure
for each descriptor in our reserved set
v = fdtable->fd[descriptor]
if (v) {
fdtable->fd[descriptor] = 0;
fput((struct file *)(v & ~1);
}
clear bit in fdtable->open_fds[]
else
for each descriptor in our reserved set
v = fdtable->fd[descriptor]
if (v)
fdtable->fd[descriptor] = v & ~1;
else
BUG
current->reservation_count = 0

That "commit" thing would be called on return from syscall
for userland threads and would be called explicitly in
kernel threads that have a descriptor table and work with
it.

The benefit would be that fd_install() would *NOT* have to be done
after the last possible failure exit - something that installs
a lot of files wouldn't have to play convoluted games on cleanup.
Simply returning an error would do the right thing.

Two stores into ->fd[] instead of one is not a big deal;
however, anything like task_work_add() to arrange the call
of "commit" ends up being bloody awful.

We could have it called from syscall glue directly, but
that means touching assembler for quite a few architectures,
and I hadn't gotten around to that. Can be done, but it's
not a pleasant work...