Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)
From: Theodore Ts'o
Date: Thu Mar 01 2018 - 11:04:48 EST
On Thu, Mar 01, 2018 at 10:55:37AM +0200, Adrian Hunter wrote:
> On 27/02/18 11:28, Adrian Hunter wrote:
> > On 26/02/18 23:48, Dmitry Osipenko wrote:
> >> But still something is wrong... I've been getting occasional EXT4 Ooops's, like
> >> the one below, and __wait_on_bit() is always figuring in the stacktrace. It
> >> never happened with blk-mq disabled, though it could be a coincidence and
> >> actually unrelated to blk-mq patches.
> >
> >> [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
> >> address 0000001c
> >> [ 6625.993004] pgd = 00b30c03
> >> [ 6625.993257] [0000001c] *pgd=00000000
> >> [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> >> [ 6625.994022] Modules linked in:
> >> [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
> >> 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
> >> [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> >> [ 6625.995595] PC is aht dx_probe+0x68/0x684
> >> [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8
This doesn't seem to make sense; the PC is where we are currently
executing, and LR is the "Link Register" where the flow of control
will be returning after the current function returns, right? Well,
dx_probe should *not* be returning to __wait_on_bit(). So this just
seems.... weird.
Ignoring the LR register, this stack trace looks sane... I can't see
which pointer could be NULL and getting dereferenced, though. How
easily can you reproduce the problem? Can you either (a) translate
the PC into a line number, or better yet, if you can reproduce, add a
series of BUG_ON's so we can see what's going on?
+ BUG_ON(frame);
memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
frame->bh = ext4_read_dirblock(dir, 0, INDEX);
if (IS_ERR(frame->bh))
return (struct dx_frame *) frame->bh;
+ BUG_ON(frame->bh);
+ BUG_ON(frame->bh->b_data);
root = (struct dx_root *) frame->bh->b_data;
if (root->info.hash_version != DX_HASH_TEA &&
root->info.hash_version != DX_HASH_HALF_MD4 &&
root->info.hash_version != DX_HASH_LEGACY) {
These are "could never" happen scenarios from looking at the code, but
that will help explain what is going on.
If this is reliably only happening with mq, the only way I could see
that if is something is returning an error when it previously wasn't.
This isn't a problem we're seeing with any of our testing, though.
Cheers,
- Ted