Re: [GIT PULL] io_uring changes for 5.9-rc1

From: Jens Axboe
Date: Mon Aug 03 2020 - 18:30:21 EST

On 8/3/20 2:48 PM, Linus Torvalds wrote:
> On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>> Lots of cleanups in here, hardening the code and/or making it easier to
>> read and fixing buts, but a core feature/change too adding support for
>> real async buffered reads. With the latter in place, we just need
>> buffered write async support and we're done relying on kthreads for the
>> fast path. In detail:
> That async buffered reads handling the the page locking flag is a
> mess, and I'm really happy I committed my page locking scalability
> change early, so that the conflicts there caught it.
> Re-using the page bit waitqueue types and exporting them?
> That part is fine, I guess, particularly since it came from the
> wait_bit_key thing and have a smell of being generic.
> Taking a random part of wake_page_function(), and calling it
> "wake_page_match()" even though that's not at all that it does?
> Not ok.

OK, I actually thought it was kind of nice and better than having that
code duplicated in two spots.

> Adding random kiocb helper functions to a core header file, when they
> are only used in one place, and when they only make sense in that one
> place?
> Not ok.

I'll move that into io_uring instead.

> When the function is called "wake_page_match()", you'd expect it
> matches the wake page information, wouldn't it?
> Yeah, it did that. And then it also checked whether the bit we're
> waiting had been set again, because everybody ostensibly wanted that.
> Except they don't any more, and that's not what the name really
> implied anyway.
> And kiocb_wait_page_queue_init() has absolutely zero business being in
> <linux/filemap.h>. There are absolutely no valid uses of that thing
> outside of the one place that calls it.
> I tried to fix up the things I could.

Thanks! As mentioned, I'll prep a cleanup patch that moves the
kiocb_wait_page_queue_init() out of there.

> That said, like a lot of io_uring code, this is some seriously opaque
> code. You say you've done a lot of cleanups, but I'm not convinced
> those cleanups are in any way offsetting adding yet another union (how
> many bugs did the last one add?) and a magic flag of "use this part of
> the union" now.

I had to think a bit about what you are referring to here, but I guess
it's the iocb part. And yes, it's not ideal, but until we support polled
IO with buffered IO, then there's no overlap in the use case. And I
don't see us ever doing that.

> And I don't know what loads you use for testing that thing, or what
> happens when the "lock_page_async()" case actually fails to lock, and
> just calls back the io_async_buf_func() wakeup function when the page
> has unlocked...
> That function doesn't actually lock the page either, but does the task
> work. I hope that work then knows to do the right thing, but it's
> really opaque and hard to follow.

The task work retries the whole thing, so it'll go through the normal
page cache read path again with all that that entails. We only really
ever use the callback to tell us when it's a good idea to retry again,
there's no other retained state there at all.

I didn't realize that part wasn't straight forward, so I'll add some
comments as well explaining how that code flow works.

It's seen a good amount of testing, both from myself and also from
others. The postgres IO rewrite has been putting it through its paces,
and outside of a few initial issues months ago, it's been rock solid.

> Anyway, I'm not entirely happy with doing these kinds of changes in
> the merge resolution, but the alternative was to not do the pull at
> all, and require you to do a lot of cleanups before I would pull it.
> Maybe I should have done that.
> So this is a slightly grumpy email about how io_uring is (a) still
> making me very nervous about a very lackadaisical approach to things,
> and having the codepaths so obscure that I'm not convinced it's not
> horribly buggy. And (b) I fixed things up without really being able to
> test them. I tested that the _normal_ paths still seem to work fine,
> but..

I need to do a better job at commenting these parts, obviously. And
while nothing is perfect, and we're definitely perfect yet, the general
trend is definitely strongly towards getting rid of odd states through
flags and unifying more of the code, and tons of fixes/cleanups that
make things easier to read and verify...

> I really think that whole thing needs a lot of comments, particularly
> around the whole io_rw_should_retry() area.
> A bit and legible comment about how it will be caught by the
> generic_file_buffered_read() page locking code, how the two cases
> differ (it might get caught by the "I'm just waiting for it to be
> unlocked", but it could *also* get caught by the "lock page now"
> case), and how it continues the request.

Noted, I'll write that up.

> As it is, it bounces between the generic code and very io_uring
> specific code in strange and not easy to follow ways.
> I've pushed out my merge of this thing, but you might also want to
> take a look at commit 2a9127fcf229 ("mm: rewrite
> wait_on_page_bit_common() logic"). I particular, the comment about how
> there's no point in even testing the page bit any more when you get
> woken up.
> I left that
> if (test_bit(key->bit_nr, &key->page->flags))
> return -1;
> logic in io_async_buf_func() (but it's not in "wake_page_match()" any
> more), but I suspect it's bogus and pointless, for the same reason
> that it isn't done for normal page waits now. Maybe it's better to
> just queue the actual work regardless, it will then be caught in the
> _real_ lock_page() or whatever it ends up doing - and if it only
> really wants to see the "uptodate" bit being set, and was just waiting
> for IO to finish, then it never really cared about the page lock bit
> at all, it just wanted to be notified about IO being done.

I just did notice your rewrite commit, and I'll adjust accordingly and
test it with that too.

> So this was a really long email to tell you - again - that I'm not
> happy with how fragile io_uring is, and how the code seems to be
> almost intentionally written to *be* fragile. Complex and hard to
> understand, and as a result it has had a fairly high rate of fairly
> nasty bugs.
> I'm hoping this isn't going to be yet another case of "nasty bugs
> because of complexity and a total disregard for explaining what is
> going on".

Outside of the review from Johannes, lots of other people did look over
the async buffered bits, and Andrew as well said it look good to him. So
while the task_work retry apparently isn't as obvious as I had hoped,
it's definitely not fragile or intentionally trying to be obtuse.

I'll make a few adjustments based on your feedback, and add a patch with
some comments as well. Hopefully that'll make the end result easier to

Jens Axboe