Re: [PATCH v6 3/8] rust: file: add Rust abstraction for `struct file`

From: Al Viro
Date: Fri May 24 2024 - 17:33:36 EST


On Fri, May 24, 2024 at 07:17:13PM +0000, Alice Ryhl wrote:

> > You obviously are aware of this but I'm just spelling it out. Iirc,
> > there will practically only ever be one light refcount per file.
> >
> > For a light refcount to be used we know that the file descriptor table
> > isn't shared with any other task. So there are no threads that could
> > concurrently access the file descriptor table. We also know that the
> > file descriptor table cannot become shared while we're in system call
> > context because the caller can't create new threads and they can't
> > unshare the file descriptor table.
> >
> > So there's only one fdget() caller (Yes, they could call fdget()
> > multiple times and then have to do fdput() multiple times but that's a
> > level of weirdness that we don't need to worry about.).
>
> Hmm. Is it not the case that different processes with different file
> descriptor tables could reference the same underlying `struct file` and
> both use light refcounts to do so, as long as each fd table is not
> shared? So there could be multiple light refcounts to the same `struct
> file` at the same time on different threads.

Relevant rules:

* Each file pointer in any descriptor table contributes to refcount
of file.

* All assignments to task->files are done by the task itself or,
during task creation, by its parent The latter happens before the task
runs for the first time. The former is done with task_lock(current)
held.

* current->files is always stable. The object it points to
is guaranteed to stay alive at least until you explicitly change
current->files.
* task->files is stable while you are holding task_lock(task).
The object it points to is guaranteed to stay alive until you release
task_lock(task).
* task->files MAY be fetched (racily) without either of the
above, but it should not be dereferenced - the memory may be freed
and reused right after you've fetched the pointer.

* descriptor tables are refcounted by table->count.
* descriptor table is created with ->count equal to 1 and
destroyed when its ->count reaches 0.
* each task with task->files == table contributes to table->count.
* before the task dies, its ->files becomes NULL (see exit_files()).
* when task is born (see copy_process() and copy_files())) the parent
is responsible for setting the value of task->files and making sure that
refcounts are correct; that's the only case where one is allowed to acquire
an extra reference to existing table (handling of clone(2) with COPY_FILES).

* the only descriptor table one may modify is that pointed to
by current->files. Any access to other threads' descriptor tables is
read-only.

* struct fd is fundamentally thread-local. It should never be
passed around, put into shared data structures, etc.

* if you have done fdget(N), the matching fdput() MUST be done
before the caller modifies the Nth slot of its descriptor table,
spawns children that would share the descriptor table.

* fdget() MAY borrow a reference from caller's descriptor table.
That can be done if current->files->count is equal to 1.
In that case we can be certain that the file reference we fetched from
our descriptor table will remain unchanged (and thus contributing to refcount
of file) until fdput(). Indeed,
+ at the time of fdget() no other thread has task->files pointing
to our table (otherwise ->count would be greater than 1).
+ our thread will remain the sole owner of descriptor table at
least until fdput(). Indeed, the first additional thread with task->files
pointing to our table would have to have been spawned by us and we are
forbidden to do that (rules for fdget() use)
+ no other thread could modify our descriptor table (they would
have to share it first).
+ we are allowed to modify our table, but we are forbidden to touch
the slot we'd copied from (rules for fdget() use).

In other words, if current->files->count is equal to 1 at fdget() time
we can skip incrementing refcount. Matching fdput() would need to
skip decrement, of course. Note that we must record that (borrowed
vs. cloned) in struct fd - the condition cannot be rechecked at fdput()
time, since the table that had been shared at fdget() time might no longer
be shared by the time of fdput().

> And this does *not* apply to `fdget_pos`, which checks the refcount of
> the `struct file` instead of the refcount of the fd table.

False. fdget_pos() is identical to fdget() as far as file refcount
handling goes. The part that is different is that grabbing ->f_pos_lock
is sensitive to file refcount in some cases. This is orthogonal to
"does this struct fd contribute to file refcount".


Again, "light" references are tied to thread; they can only be created
if we are guaranteed that descriptor table's slot they came from will
remain unchanged for as long as the reference is used.

And yes, there may be several light references to the same file - both
in different processes that do not share descriptor table *and* in the
same thread, if e.g. sendfile(in_fd, out_fd, ...) is called with
in_fd == out_fd.