Re: [PATCH 0/13] Per-bdi writeback flusher threads #3

From: Jan Kara
Date: Tue Apr 21 2009 - 09:25:38 EST


On Fri 17-04-09 15:07:03, Jens Axboe wrote:
> On Mon, Apr 13 2009, Zhang, Yanmin wrote:
> > On Fri, 2009-04-10 at 09:21 +0200, Jens Axboe wrote:
> > > On Fri, Apr 10 2009, Zhang, Yanmin wrote:
> > > > On Wed, 2009-04-08 at 14:00 +0200, Jens Axboe wrote:
> > > > > Hi,
> > > > >
> > > > > This is the third version of the patchset. Changes since v2:
> > > > Jens,
> > > >
> > > > The patchset forgets kernel/sysctl.c part, so the compilation link fails.
> > >
> > > Yep, noticed that as well. Not really great timing with adding more
> > > pdflush tunables...
> > When I run a fio testing with a partition on one of my machines, I hit below bug
> > checking. It's a ??? fio_mmap_randwrite_4k_preread testing. I'm not sure if it happens
> > after the testing and before the next testing, or just happens when the testing is
> > ongoing. When I see it, next testing stops at the beginning when mounting the partition.
> >
> > ----------------------sys log--------------------
> >
> > kernel BUG at fs/jbd/journal.c:1166!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/block/sdb/stat
> > CPU 0
> > Modules linked in: igb
> > Pid: 18329, comm: umount Not tainted 2.6.30-rc1-bdiflusherv3 #1 X8DTN
> > RIP: 0010:[<ffffffff80344747>] [<ffffffff80344747>] journal_destroy+0x108/0x1b4
> > RSP: 0018:ffff8800bd95fe08 EFLAGS: 00010286
> > RAX: 000000000000c7c7 RBX: ffff8800bc93dc00 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff8800bc93dd64
> > RBP: ffff8800bc93dd64 R08: ffff8800bc93dc88 R09: 00000000bd95fcc0
> > R10: 0000000000000246 R11: 00000000fffffffb R12: ffff8800bd95fe08
> > R13: ffff8800bd95fe20 R14: ffff8800bc93dc98 R15: ffff8800bc93dcc8
> > FS: 00007f3f26c717d0(0000) GS:ffffc20000000000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00007f3f263fe000 CR3: 00000000bc18e000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process umount (pid: 18329, threadinfo ffff8800bd95e000, task ffff8800bcdcdca0)
> > Stack:
> > 0000000000000000 ffff8800bcdcdca0 ffffffff8024ddaa ffff8800bd95fe20
> > ffff8800bd95fe20 ffffffff802e53ac ffff8800bd95fe38 ffff8800bd0e3a00
> > ffff8800a1ff2400 ffff8800bc93b000 ffff8800bb8cb400 0000000000000000
> > Call Trace:
> > [<ffffffff8024ddaa>] ? autoremove_wake_function+0x0/0x2e
> > [<ffffffff802e53ac>] ? mb_cache_shrink+0x22/0x9e
> > [<ffffffff8032fbcc>] ? ext3_put_super+0x29/0x204
> > [<ffffffff802ac7ab>] ? generic_shutdown_super+0x70/0xde
> > [<ffffffff802ac83b>] ? kill_block_super+0x22/0x3a
> > [<ffffffff802ac8fe>] ? deactivate_super+0x5f/0x77
> > [<ffffffff802bf20a>] ? sys_umount+0x2da/0x309
> > [<ffffffff8020ba2b>] ? system_call_fastpath+0x16/0x1b
> > Code: ff 48 8d ab 64 01 00 00 eb 0b fe 45 00 48 89 df e8 cc e8 ff ff 48 89 ef e8 2f e8 3d 00 48 83 7b 60 00 75 e6 48 83 7b 50 00 74 04 <0f> 0b eb fe 48 83 7b 58 00 74 04 0f 0b eb fe fe 83 64 01 00 00
> > RIP [<ffffffff80344747>] journal_destroy+0x108/0x1b4
> > RSP <ffff8800bd95fe08>
> > ---[ end trace 14f5ccbafee46eb4 ]---
>
> This is very strange, I have about zero idea why this would trigger. It
> seems to be this:
>
> J_ASSERT(journal->j_running_transaction == NULL);
>
> in journal_destroy(), presumably complaining that we are trying to kill
> a journal that has a running transaction (yes, me read code good).
> Jan Kara is on the CC and he appears to have touched that file the most
> recently. Jan, any ideas on what might make this trigger?
I've looked into this. What I think is happening:
we umount the filesystem
fsync_super(sb)
__fsync_super(sb)
sync_inode_sb(sb, 0) -> submits some inodes
...
ext3_sync_fs() -> wait for transaction commit
...
sync_inodes_sb(sb, 1) -> submits further IO which did not happen before
-> transaction started
ext3_put_super()
journal_destroy() -> BUG

So obviously it has changed how sync_inodes_sb() work. Now the fact that
ext3 relies on all the IO being submitted during sync_inode_sb() is fragile
to say the least (although during umount not that much) - actually a
similar problem is in sys_sync() path... Attached patch should fix it -
it's not final since the sys_sync() path needs some more tweaking but it
should work at least for common filesystems ;).

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR