Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
From: Jens Axboe
Date: Sun Jan 26 2020 - 11:57:40 EST
On 1/25/20 10:54 PM, Andres Freund wrote:
> Hi,
>
> On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
>> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>>> From: Jens Axboe <axboe@xxxxxxxxx>
>>>
>>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>>
>>> If the credentials or the mm doesn't match, don't allow the task to
>>> submit anything on behalf of this ring. The task that owns the ring can
>>> pass the file descriptor to another task, but we don't want to allow
>>> that task to submit an SQE that then assumes the ring mm and creds if
>>> it needs to go async.
>>>
>>> Cc: stable@xxxxxxxxxxxxxxx
>>> Suggested-by: Stefan Metzmacher <metze@xxxxxxxxx>
>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>>>
>>>
>>> ---
>>> fs/io_uring.c | 6 ++++++
>>> 1 file changed, 6 insertions(+)
>>>
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>> wake_up(&ctx->sqo_wait);
>>> submitted = to_submit;
>>> } else if (to_submit) {
>>> + if (current->mm != ctx->sqo_mm ||
>>> + current_cred() != ctx->creds) {
>>> + ret = -EPERM;
>>> + goto out;
>>> + }
>>> +
>>
>> I thought about this a bit more.
>>
>> I'm not sure if this is actually to restrictive,
>> because it means applications like Samba won't
>> be able to use io-uring at all.
>
> Yea, I think it is too restrictive. In fact, it broke my WIP branch to
> make postgres use io_uring.
>
>
> Postgres uses a forked process model, with all sub-processes forked off
> one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED
> memory (buffer pool, locks, and lots of other IPC). My WIP branch so far
> has postmaster create a number of io_urings that then the different
> processes can use (with locking if necessary).
>
> In plenty of the cases it's fairly important for performance to not
> require an additional context switch initiate IO, therefore we cannot
> delegate submitting to an io_uring to separate process. But it's not
> feasible to have one (or even two) urings for each process either: For
> one, that's just about guaranteed to bring us over the default
> RLIMIT_MEMLOCK limit, and requiring root only config changes is not an
> option for many (nor user friendly).
>
>
> Not sharing queues makes it basically impossible to rely on io_uring
> ordering properties when operation interlock is needed. E.g. to
> guarantee that the journal is flushed before some data buffer can be
> written back, being able to make use of links and drains is great - but
> there's one journal for all processes. To be able to guarantee anything,
> all the interlocked writes need to go through one io_uring. I've not yet
> implemented this, so I don't have numbers, but I expect pretty
> significant savings.
>
>
> Not being able to share urings also makes it harder to resolve
> deadlocks:
>
> As we call into both library and user defined code, we cannot guarantee
> that a specific backend process will promptly (or at all, when waiting
> for some locks) process cqes. There's also sections where we don't want
> to constantly check for ready events, for performance reasons. But
> operations initiated by a process might be blocking other connections:
>
> E.g. process #1 might have initiated transferring a number of blocks
> into postgres' buffer pool via io_uring , and now is busy processing the
> first block that completed. But now process #2 might need one of the
> buffers that had IO queued, but didn't complete in time for #1 to see
> the results. The way I have it set up right now, #2 simply can process
> pending cqes in the relevant queue. Which, in this example, would mark
> the pg buffer pool entry as valid, allowing #2 to continue.
>
> Now, completions can still be read by all processes, so I could continue
> to do the above: But that'd require all potentially needed io_urings to
> be set up in postmaster, before the first fork, and all processes to
> keep all those FDs open (commonly several hundred). Not an attractive
> option either, imo.
>
> Obviously we could solve this by having a sqe result processing thread
> running within each process - but that'd be a very significant new
> overhead. And it'd require making some non-threadsafe code threadsafe,
> which I do not relish tackling as a side-effect of io_uring adoption.
>
>
> It also turns out to be nice from a performance / context-switch rate
> angle to be able to process cqes for submissions by other
> processes. Saves an expensive context switch, and often enough it really
> doesn't matter which process processes the completion (especially for
> readahead). And in other cases it's cheap to just schedule the
> subsequent work from the cqe processor, e.g. initiating readahead of a
> few more blocks into the pg buffer pool. Similarly, there are a few
> cases where it's useful for several processes to submit IO into a uring
> primarily drained by one specific process, to offload the subsequent
> action, if that's expensive
>
>
> Now, I think there's a valid argument to be made that postgres should
> just use threads, and not be hampered by any of this. But a) that's not
> going to happen all that soon, it's a large change, b) it's far from
> free from problems either, especially scalability on larger machines,
> and robustness.
>
>
>> As even if current_cred() and ctx->creds describe the same
>> set of uid,gids the != won't ever match again and
>> makes the whole ring unuseable.
>
> Indeed. It also seems weird that a sqpoll now basically has different
> semantics, allowing the io_uring to be used by multiple processes - a
> task with a different mm can still wake the sqpoll thread up, even.
>
> Since the different processes attached still can still write to the
> io_uring mmaped memory, they can still queue sqes, they just can't
> initiate the processing. But the next time the creator of the uring
> submits, they will still be - and thus it seems that the kernel needs to
> handle this safely. So I really don't get what this actually achieves?
> Am I missing something here?
Thanks for your detailed reported, I'm going to revert this change for
5.5.
--
Jens Axboe