Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

From: Dmitry Osipenko
Date: Fri Mar 02 2018 - 11:39:39 EST


On 01.03.2018 23:20, Andreas Dilger wrote:
>
> On Mar 1, 2018, at 9:04 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
>> This doesn't seem to make sense; the PC is where we are currently
>> executing, and LR is the "Link Register" where the flow of control
>> will be returning after the current function returns, right? Well,
>> dx_probe should *not* be returning to __wait_on_bit(). So this just
>> seems.... weird.
>>
>> Ignoring the LR register, this stack trace looks sane... I can't see
>> which pointer could be NULL and getting dereferenced, though. How
>> easily can you reproduce the problem? Can you either (a) translate
>> the PC into a line number, or better yet, if you can reproduce, add a
>> series of BUG_ON's so we can see what's going on?

Ted, thank you for the suggestion. I don't have a bug-reproducer, it happens
only under some IO load and quite randomly. I've applied the BUG_ON()'s, but it
may take some time to catch the bug again.

>> + BUG_ON(frame);
>
> I think you mean:
> BUG_ON(frame == NULL);
> or
> BUG_ON(!frame);
>
>
>> memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>> frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>> if (IS_ERR(frame->bh))
>> return (struct dx_frame *) frame->bh;
>>
>> + BUG_ON(frame->bh);
>> + BUG_ON(frame->bh->b_data);
>
> Same here.
>
> BUG_ON(frame->bh == NULL);
> BUG_ON(frame->bh->b_data == NULL);
>
> This is why I don't like implicit "is NULL" or "is non-zero" usage. Lustre
> used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to
> abandon that because of upstream code style.

Well spotted, thanks Andreas.

>> root = (struct dx_root *) frame->bh->b_data;
>> if (root->info.hash_version != DX_HASH_TEA &&
>> root->info.hash_version != DX_HASH_HALF_MD4 &&
>> root->info.hash_version != DX_HASH_LEGACY) {
>>
>> These are "could never" happen scenarios from looking at the code, but
>> that will help explain what is going on.
>>
>> If this is reliably only happening with mq, the only way I could see
>> that if is something is returning an error when it previously wasn't.
>> This isn't a problem we're seeing with any of our testing, though.