Re: Bug#1093243: Upgrade to 6.1.123 kernel causes mariadb hangs
From: Jens Axboe
Date: Fri Jan 24 2025 - 15:51:45 EST
On 1/24/25 1:33 PM, Salvatore Bonaccorso wrote:
> Hi Pavel,
>
> On Fri, Jan 24, 2025 at 06:40:51PM +0000, Pavel Begunkov wrote:
>> On 1/24/25 16:30, Xan Charbonnet wrote:
>>> On 1/24/25 04:33, Pavel Begunkov wrote:
>>>> Thanks for narrowing it down. Xan, can you try this change please?
>>>> Waiters can miss wake ups without it, seems to match the description.
>>>>
>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>> index 9b58ba4616d40..e5a8ee944ef59 100644
>>>> --- a/io_uring/io_uring.c
>>>> +++ b/io_uring/io_uring.c
>>>> @@ -592,8 +592,10 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
>>>> io_commit_cqring(ctx);
>>>> spin_unlock(&ctx->completion_lock);
>>>> io_commit_cqring_flush(ctx);
>>>> - if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
>>>> + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
>>>> + smp_mb();
>>>> __io_cqring_wake(ctx);
>>>> + }
>>>> }
>>>> void io_cq_unlock_post(struct io_ring_ctx *ctx)
>>>>
>>>
>>>
>>> Thanks Pavel! Early results look very good for this change. I'm now running 6.1.120 with your added smp_mb() call. The backup process which had been quickly triggering the issue has been running longer than it ever did when it would ultimately fail. So that's great!
>>>
>>> One sour note: overnight, replication hung on this machine, which is another failure that started happening with the jump from 6.1.119 to 6.1.123. The machine was running 6.1.124 with the __io_cq_unlock_post_flush function removed completely. That's the kernel we had celebrated yesterday for running the backup process successfully.
>>>
>>> So, we might have two separate issues to deal with, unfortunately.
>>
>> Possible, but it could also be a side effect of reverting the patch.
>> As usual, in most cases patches are ported either because they're
>> fixing sth or other fixes depend on it, and it's not yet apparent
>> to me what happened with this one.
>
> I researched bit the lists, and there was the inclusion request on the
> stable list itself. Looking into the io-uring list I found
> https://lore.kernel.org/io-uring/CADZouDRFJ9jtXHqkX-PTKeT=GxSwdMC42zEsAKR34psuG9tUMQ@xxxxxxxxxxxxxx/
> which I think was the trigger to later on include in fact the commit
> in 6.1.120.
Yep indeed, was just looking for the backstory and that is why it got
backported. Just missed the fact that it should've been an
io_cqring_wake() rather than __io_cqring_wake()...
--
Jens Axboe