Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

From: Pavel Begunkov
Date: Sun Aug 15 2021 - 06:49:39 EST


On 8/15/21 4:31 AM, Josh Triplett wrote:
> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>> interface, I think it makes sense to use a dedicated flag for this, like
>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>> including open, accept, and in the future many other operations such as
>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>> close it.)
>>>
>>> The only downside is that you'll need to reject that flag in all
>>> non-open operations. One way to unify that code might be to add a flag
>>> in io_op_def for open-like operations, and then check in common code for
>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>
>> io_uring is really thin, and so I absolutely don't want any extra
>> overhead in the generic path, IOW anything affecting
>> reads/writes/sends/recvs.
>
> There are already several checks for valid flags in io_init_req. For
> instance:

Yes, it's horrible and I don't want to make it any worse.

> if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
> !io_op_defs[req->opcode].buffer_select)
> return -EOPNOTSUPP;
> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
> bitwise op tells you if any invalid flags were passed. *Zero* additional
> overhead for other operations.

Good point

> Alternatively, since there are so few operations that open a file
> descriptor, you could just add a separate opcode for those few
> operations. That still seems preferable to overloading a 16-bit index
> field for this.

I don't think so

> With this new mechanism, I think we're going to want to support more
> than 65535 fixed-file entries. I can easily imagine wanting to handle
> hundreds of thousands of files or sockets this way.

May be. What I'm curious about is that the feature doesn't really
change anything in this regard, but seems I haven't heard people
asking for larger tables.

>> The other reason is that there are only 2 bits left in sqe->flags,
>> and we may use them for something better, considering that it's
>> only open/accept and not much as this.
>
> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
> memfd_secret.

We could argue for many of those whether they should be in io_uring,
and whether there are many benefits having them async and so. It would
have another story if all the ecosystem was io_uring centric, but
that's speculations.

> Of those, I personally would *love* to have at least pipe, socket,
> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
> dup-like operation for moving things between the fixed-file table and
> file descriptors.
>
> I think this is valuable and versatile enough to merit a flag. It would
> also be entirely reasonable to create separate operations for these. But
> either way, I don't think this should just be determined by whether a
> 16-bit index is non-zero.
>
>> I agree that it feels error-prone, but at least it can be wrapped
>> nicely enough in liburing, e.g.
>>
>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>> const char *path, int flags,
>> mode_t mode, int slot_idx);
>
> That wrapper wouldn't be able to handle more than a 16-bit slot index
> though.

It would. Note, the index is "int" there, so if doesn't fit
into u16, we can fail it. And do conversion if required.

>>> Also, rather than using a 16-bit index for the fixed file table and
>>> potentially requiring expansion into a different field in the future,
>>> what about overlapping it with the nofile field in the open and accept
>>> requests? If they're not opening a normal file descriptor, they don't
>>> need nofile. And in the original sqe, you can then overlap it with a
>>> 32-bit field like splice_fd_in.
>>
>> There is no nofile in SQEs, though
>>
>> req->open.nofile = rlimit(RLIMIT_NOFILE);
>
> nofile isn't needed for opening into the fixed-file table, so it could
> be omitted in that case, and another field unioned with it.

There is no problem to place it internally. Moreover, it's at the
moment uniformly placed inside io_kiocb, but with nofile we'd need
to find the place on per-op basis.

Not like any matters, it's just bike shedding.

> allow passing a 32-bit fixed-file index into open and accept without
> growing the size of their structures. I think, with this new capability,
> we're going to want a large number of fixed files available.
>
> In the SQE, you could overlap it with the splice_fd_in field, which
> isn't needed by any calls other than splice.

But it doesn't mean it won't be used, as happened with pretty every
other field in SQE. So, it rather depends on what packing is wanted.
And reusing almost never used ->buf_index (and potentially ->ioprio),
sounds reasonable.

--
Pavel Begunkov