Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task

From: Andres Freund
Date: Sun Jan 26 2020 - 00:55:16 EST

Next message: Dexuan Cui: "[PATCH v4 4/4] hv_utils: Add the support of hibernation"
Previous message: Dexuan Cui: "[PATCH v4 1/4] Tools: hv: Reopen the devices if read() or write() returns errors"
In reply to: Jens Axboe: "Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task"
Next in thread: Jens Axboe: "Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
> > From: Jens Axboe <axboe@xxxxxxxxx>
> >
> > commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
> >
> > If the credentials or the mm doesn't match, don't allow the task to
> > submit anything on behalf of this ring. The task that owns the ring can
> > pass the file descriptor to another task, but we don't want to allow
> > that task to submit an SQE that then assumes the ring mm and creds if
> > it needs to go async.
> >
> > Cc: stable@xxxxxxxxxxxxxxx
> > Suggested-by: Stefan Metzmacher <metze@xxxxxxxxx>
> > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> >
> >
> > ---
> > fs/io_uring.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > wake_up(&ctx->sqo_wait);
> > submitted = to_submit;
> > } else if (to_submit) {
> > + if (current->mm != ctx->sqo_mm ||
> > + current_cred() != ctx->creds) {
> > + ret = -EPERM;
> > + goto out;
> > + }
> > +
>
> I thought about this a bit more.
>
> I'm not sure if this is actually to restrictive,
> because it means applications like Samba won't
> be able to use io-uring at all.

Yea, I think it is too restrictive. In fact, it broke my WIP branch to
make postgres use io_uring.

Postgres uses a forked process model, with all sub-processes forked off
one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED
memory (buffer pool, locks, and lots of other IPC). My WIP branch so far
has postmaster create a number of io_urings that then the different
processes can use (with locking if necessary).

In plenty of the cases it's fairly important for performance to not
require an additional context switch initiate IO, therefore we cannot
delegate submitting to an io_uring to separate process. But it's not
feasible to have one (or even two) urings for each process either: For
one, that's just about guaranteed to bring us over the default
RLIMIT_MEMLOCK limit, and requiring root only config changes is not an
option for many (nor user friendly).

Not sharing queues makes it basically impossible to rely on io_uring
ordering properties when operation interlock is needed. E.g. to
guarantee that the journal is flushed before some data buffer can be
written back, being able to make use of links and drains is great - but
there's one journal for all processes. To be able to guarantee anything,
all the interlocked writes need to go through one io_uring. I've not yet
implemented this, so I don't have numbers, but I expect pretty
significant savings.

Not being able to share urings also makes it harder to resolve
deadlocks:

As we call into both library and user defined code, we cannot guarantee
that a specific backend process will promptly (or at all, when waiting
for some locks) process cqes. There's also sections where we don't want
to constantly check for ready events, for performance reasons. But
operations initiated by a process might be blocking other connections:

E.g. process #1 might have initiated transferring a number of blocks
into postgres' buffer pool via io_uring , and now is busy processing the
first block that completed. But now process #2 might need one of the
buffers that had IO queued, but didn't complete in time for #1 to see
the results. The way I have it set up right now, #2 simply can process
pending cqes in the relevant queue. Which, in this example, would mark
the pg buffer pool entry as valid, allowing #2 to continue.

Now, completions can still be read by all processes, so I could continue
to do the above: But that'd require all potentially needed io_urings to
be set up in postmaster, before the first fork, and all processes to
keep all those FDs open (commonly several hundred). Not an attractive
option either, imo.

Obviously we could solve this by having a sqe result processing thread
running within each process - but that'd be a very significant new
overhead. And it'd require making some non-threadsafe code threadsafe,
which I do not relish tackling as a side-effect of io_uring adoption.

It also turns out to be nice from a performance / context-switch rate
angle to be able to process cqes for submissions by other
processes. Saves an expensive context switch, and often enough it really
doesn't matter which process processes the completion (especially for
readahead). And in other cases it's cheap to just schedule the
subsequent work from the cqe processor, e.g. initiating readahead of a
few more blocks into the pg buffer pool. Similarly, there are a few
cases where it's useful for several processes to submit IO into a uring
primarily drained by one specific process, to offload the subsequent
action, if that's expensive

Now, I think there's a valid argument to be made that postgres should
just use threads, and not be hampered by any of this. But a) that's not
going to happen all that soon, it's a large change, b) it's far from
free from problems either, especially scalability on larger machines,
and robustness.

> As even if current_cred() and ctx->creds describe the same
> set of uid,gids the != won't ever match again and
> makes the whole ring unuseable.

Indeed. It also seems weird that a sqpoll now basically has different
semantics, allowing the io_uring to be used by multiple processes - a
task with a different mm can still wake the sqpoll thread up, even.

Since the different processes attached still can still write to the
io_uring mmaped memory, they can still queue sqes, they just can't
initiate the processing. But the next time the creator of the uring
submits, they will still be - and thus it seems that the kernel needs to
handle this safely. So I really don't get what this actually achieves?
Am I missing something here?

Greetings,

Andres Freund

Next message: Dexuan Cui: "[PATCH v4 4/4] hv_utils: Add the support of hibernation"
Previous message: Dexuan Cui: "[PATCH v4 1/4] Tools: hv: Reopen the devices if read() or write() returns errors"
In reply to: Jens Axboe: "Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task"
Next in thread: Jens Axboe: "Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]