Re: [PATCH v6 3/8] rust: file: add Rust abstraction for `struct file`

From: Alice Ryhl
Date: Mon May 27 2024 - 12:14:35 EST


Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:
> > > You obviously are aware of this but I'm just spelling it out. Iirc,
> > > there will practically only ever be one light refcount per file.
> > >
> > > For a light refcount to be used we know that the file descriptor table
> > > isn't shared with any other task. So there are no threads that could
> > > concurrently access the file descriptor table. We also know that the
> > > file descriptor table cannot become shared while we're in system call
> > > context because the caller can't create new threads and they can't
> > > unshare the file descriptor table.
> > >
> > > So there's only one fdget() caller (Yes, they could call fdget()
> > > multiple times and then have to do fdput() multiple times but that's a
> > > level of weirdness that we don't need to worry about.).
> >
> > Hmm. Is it not the case that different processes with different file
> > descriptor tables could reference the same underlying `struct file` and
> > both use light refcounts to do so, as long as each fd table is not
> > shared? So there could be multiple light refcounts to the same `struct
> > file` at the same time on different threads.
>
> Relevant rules:
>
> * Each file pointer in any descriptor table contributes to refcount
> of file.
>
> * All assignments to task->files are done by the task itself or,
> during task creation, by its parent The latter happens before the task
> runs for the first time. The former is done with task_lock(current)
> held.
>
> * current->files is always stable. The object it points to
> is guaranteed to stay alive at least until you explicitly change
> current->files.
> * task->files is stable while you are holding task_lock(task).
> The object it points to is guaranteed to stay alive until you release
> task_lock(task).
> * task->files MAY be fetched (racily) without either of the
> above, but it should not be dereferenced - the memory may be freed
> and reused right after you've fetched the pointer.
>
> * descriptor tables are refcounted by table->count.
> * descriptor table is created with ->count equal to 1 and
> destroyed when its ->count reaches 0.
> * each task with task->files == table contributes to table->count.
> * before the task dies, its ->files becomes NULL (see exit_files()).
> * when task is born (see copy_process() and copy_files())) the parent
> is responsible for setting the value of task->files and making sure that
> refcounts are correct; that's the only case where one is allowed to acquire
> an extra reference to existing table (handling of clone(2) with COPY_FILES).
>
> * the only descriptor table one may modify is that pointed to
> by current->files. Any access to other threads' descriptor tables is
> read-only.
>
> * struct fd is fundamentally thread-local. It should never be
> passed around, put into shared data structures, etc.
>
> * if you have done fdget(N), the matching fdput() MUST be done
> before the caller modifies the Nth slot of its descriptor table,
> spawns children that would share the descriptor table.
>
> * fdget() MAY borrow a reference from caller's descriptor table.
> That can be done if current->files->count is equal to 1.
> In that case we can be certain that the file reference we fetched from
> our descriptor table will remain unchanged (and thus contributing to refcount
> of file) until fdput(). Indeed,
> + at the time of fdget() no other thread has task->files pointing
> to our table (otherwise ->count would be greater than 1).
> + our thread will remain the sole owner of descriptor table at
> least until fdput(). Indeed, the first additional thread with task->files
> pointing to our table would have to have been spawned by us and we are
> forbidden to do that (rules for fdget() use)
> + no other thread could modify our descriptor table (they would
> have to share it first).
> + we are allowed to modify our table, but we are forbidden to touch
> the slot we'd copied from (rules for fdget() use).
>
> In other words, if current->files->count is equal to 1 at fdget() time
> we can skip incrementing refcount. Matching fdput() would need to
> skip decrement, of course. Note that we must record that (borrowed
> vs. cloned) in struct fd - the condition cannot be rechecked at fdput()
> time, since the table that had been shared at fdget() time might no longer
> be shared by the time of fdput().

This is great! It matches my understanding. I didn't know the details
about current->files and task->files.

You should copy this to the kernel documentation somewhere. :)

> > And this does *not* apply to `fdget_pos`, which checks the refcount of
> > the `struct file` instead of the refcount of the fd table.
>
> False. fdget_pos() is identical to fdget() as far as file refcount
> handling goes. The part that is different is that grabbing ->f_pos_lock
> is sensitive to file refcount in some cases. This is orthogonal to
> "does this struct fd contribute to file refcount".

Sorry, I see now that I didn't phrase that quite right. What I meant is
that there are ways of sharing a `struct file` reference during an fdget
scope that are not dangerous, but where it *would be* dangerous if it
was an fdget_pos scope instead. Specifically, the reason they are
dangerous is that they can lead to a data race on the file position if
the fdget_pos scope did not take the f_pos_lock mutex.

For example, during an `fdget(N)` scope, you can always do a `get_file`
and then send it to another process and `fd_install` it into that other
process. There's no way that this could result in the deletion of the
Nth entry of `current->files`.

However, during an `fdget_pos(N)` scope, then it is *not* the case that
it's always okay to send a `get_file` reference to another thread and
`fd_install` it. Because after the remote process returns from the
syscall in which we `fd_install`ed the file, the remote process could
proceed to call another syscall that in turn modifies the file position.
And if the original `fdget_pos(N)` scope modifies the file position
after sending the `get_file` reference, then that could be a data race
on f_pos.

> Again, "light" references are tied to thread; they can only be created
> if we are guaranteed that descriptor table's slot they came from will
> remain unchanged for as long as the reference is used.
>
> And yes, there may be several light references to the same file - both
> in different processes that do not share descriptor table *and* in the
> same thread, if e.g. sendfile(in_fd, out_fd, ...) is called with
> in_fd == out_fd.

Thanks for confirming this!





I hope this reply along with my reply to Christian Brauner also
addresses your other thread. Let me know if it doesn't.

Alice