Re: khugepaged: inconsistent lock {RECLAIM_FS-ON-W} ->{IN-RECLAIM_FS-W} usage

From: Dave Chinner
Date: Tue Aug 23 2011 - 19:58:03 EST


On Tue, Aug 23, 2011 at 03:10:13PM +0300, Sergey Senozhatsky wrote:
> Hello,
>
> [12027.382589] =================================
> [12027.382594] [ INFO: inconsistent lock state ]
> [12027.382600] 3.1.0-rc3-dbg-00548-gba7b8dc #692
> [12027.382603] ---------------------------------
> [12027.382607] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [12027.382614] khugepaged/552 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [12027.382619] (&sb->s_type->i_mutex_key#9){+.+.?.}, at: [<ffffffff811c11f7>] ext4_evict_inode+0xb0/0x51d
> [12027.382640] {RECLAIM_FS-ON-W} state was registered at:
> [12027.382644] [<ffffffff81071f8b>] mark_held_locks+0xc3/0xef
> [12027.382655] [<ffffffff810725a8>] lockdep_trace_alloc+0x9b/0xbd
> [12027.382663] [<ffffffff810ff580>] kmem_cache_alloc+0x2a/0x1a4
> [12027.382671] [<ffffffff8111d972>] __d_alloc+0x22/0x164
> [12027.382679] [<ffffffff8111dd5d>] d_alloc+0x19/0x7a
> [12027.382686] [<ffffffff811132f9>] d_alloc_and_lookup+0x27/0x66
> [12027.382695] [<ffffffff81113dc8>] do_lookup+0x1df/0x2e9
> [12027.382702] [<ffffffff81114566>] link_path_walk+0x1b3/0x738
> [12027.382709] [<ffffffff81115499>] path_lookupat+0x57/0x5ee
> [12027.382717] [<ffffffff81115a53>] do_path_lookup+0x23/0x9f
> [12027.382724] [<ffffffff81116f75>] user_path_at+0x54/0x91
> [12027.382731] [<ffffffff8110dc46>] vfs_fstatat+0x3f/0x69
> [12027.382738] [<ffffffff8110dca1>] vfs_stat+0x16/0x18
> [12027.382745] [<ffffffff8110dd87>] sys_newstat+0x15/0x2e
> [12027.382751] [<ffffffff814965d2>] system_call_fastpath+0x16/0x1b
> [12027.382761] irq event stamp: 9401171
> [12027.382765] hardirqs last enabled at (9401171): [<ffffffff810cee07>] free_hot_cold_page+0x168/0x181
> [12027.382776] hardirqs last disabled at (9401170): [<ffffffff810cecfd>] free_hot_cold_page+0x5e/0x181
> [12027.382785] softirqs last enabled at (9399656): [<ffffffff81045555>] __do_softirq+0x261/0x2ff
> [12027.382795] softirqs last disabled at (9399635): [<ffffffff81497abc>] call_softirq+0x1c/0x30
> [12027.382805]
> [12027.382805] other info that might help us debug this:
> [12027.382809] Possible unsafe locking scenario:
> [12027.382811]
> [12027.382814] CPU0
> [12027.382817] ----
> [12027.382819] lock(&sb->s_type->i_mutex_key);
> [12027.382826] <Interrupt>
> [12027.382829] lock(&sb->s_type->i_mutex_key);
> [12027.382835]
> [12027.382836] *** DEADLOCK ***
> [12027.382838]
> [12027.382842] 2 locks held by khugepaged/552:
> [12027.382846] #0: (shrinker_rwsem){++++..}, at: [<ffffffff810d6465>] shrink_slab+0x37/0x3d0
> [12027.382862] #1: (&type->s_umount_key#16){+++++.}, at: [<ffffffff8110c512>] grab_super_passive+0x52/0x76
> [12027.382880]
> [12027.382881] stack backtrace:
> [12027.382887] Pid: 552, comm: khugepaged Not tainted 3.1.0-rc3-dbg-00548-gba7b8dc #692
> [12027.382891] Call Trace:
> [12027.382902] [<ffffffff81486180>] print_usage_bug+0x28f/0x2a0
> [12027.382913] [<ffffffff8100cb2a>] ? save_stack_trace+0x27/0x44
> [12027.382922] [<ffffffff8106ee4e>] ? print_irq_inversion_bug+0x1cd/0x1cd
> [12027.382929] [<ffffffff8106fa1e>] mark_lock+0x2eb/0x53a
> [12027.382937] [<ffffffff81070321>] __lock_acquire+0x6b4/0x164b
> [12027.382945] [<ffffffff8106d318>] ? __bfs+0x23/0x1c7
> [12027.382952] [<ffffffff8106f62c>] ? check_irq_usage+0x99/0xab
> [12027.382960] [<ffffffff810711df>] ? __lock_acquire+0x1572/0x164b
> [12027.382968] [<ffffffff811c11f7>] ? ext4_evict_inode+0xb0/0x51d
> [12027.382975] [<ffffffff8107186e>] lock_acquire+0x138/0x1ac
> [12027.382982] [<ffffffff811c11f7>] ? ext4_evict_inode+0xb0/0x51d
> [12027.382990] [<ffffffff811c11f7>] ? ext4_evict_inode+0xb0/0x51d
> [12027.382998] [<ffffffff8148e953>] mutex_lock_nested+0x5e/0x325
> [12027.383005] [<ffffffff811c11f7>] ? ext4_evict_inode+0xb0/0x51d
> [12027.383013] [<ffffffff81120013>] ? evict+0x64/0x15c
> [12027.383022] [<ffffffff81256185>] ? do_raw_spin_lock+0x6b/0x122
> [12027.383030] [<ffffffff811c11f7>] ext4_evict_inode+0xb0/0x51d
> [12027.383038] [<ffffffff811c1147>] ? ext4_da_writepages+0x6df/0x6df
> [12027.383045] [<ffffffff81120050>] evict+0xa1/0x15c
> [12027.383051] [<ffffffff81120137>] dispose_list+0x2c/0x38
> [12027.383059] [<ffffffff811211c6>] prune_icache_sb+0x28c/0x29b
> [12027.383067] [<ffffffff8110c60b>] prune_super+0xd5/0x140
> [12027.383074] [<ffffffff810d6624>] shrink_slab+0x1f6/0x3d0
> [12027.383083] [<ffffffff810d9448>] do_try_to_free_pages+0x1ae/0x330
> [12027.383091] [<ffffffff810d979b>] try_to_free_pages+0x110/0x241
> [12027.383099] [<ffffffff810ce50a>] __alloc_pages_nodemask+0x4d2/0x801
> [12027.383108] [<ffffffff814902ea>] ? _raw_spin_unlock_irqrestore+0x56/0x74
> [12027.383116] [<ffffffff81102510>] khugepaged_alloc_hugepage+0x50/0xdd
> [12027.383127] [<ffffffff8105dc32>] ? __init_waitqueue_head+0x46/0x46
> [12027.383134] [<ffffffff81102aa1>] khugepaged+0x82/0xff5
> [12027.383141] [<ffffffff8148cadd>] ? schedule+0x353/0xa7e
> [12027.383150] [<ffffffff8105dc32>] ? __init_waitqueue_head+0x46/0x46
> [12027.383157] [<ffffffff81102a1f>] ? khugepaged_defrag_store+0x57/0x57
> [12027.383164] [<ffffffff8105d3e6>] kthread+0x9a/0xa2
> [12027.383173] [<ffffffff814979c4>] kernel_thread_helper+0x4/0x10
> [12027.383181] [<ffffffff8102d6d6>] ? finish_task_switch+0x76/0xf0
> [12027.383188] [<ffffffff814907b8>] ? retint_restore_args+0x13/0x13
> [12027.383196] [<ffffffff8105d34c>] ? __init_kthread_worker+0x53/0x53
> [12027.383203] [<ffffffff814979c0>] ? gs_change+0x13/0x13

Looks like ext4 is still treading in the footsteps of XFS without
understanding where they lead.... :/

ext4 is taking the i_mutex in .evict, which means that it can be
called from memory reclaim context. The ext4 developers need to
determine if this is indeed safe and deadlock free - it probably is
safe if a reference is held on the inode everywhere else the i_mutex
is taken. If it is safe, ext4 needs to add lockdep re-initialisation
to the i_mutex in .evict before it gets taken in the .evict path.

In XFS these reports are all false positives, so we need lockdep
annotations to prevent them. When we first allocate the inode, we
initialise the iolock with a specific class:

mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
lockdep_set_class_and_name(&ip->i_iolock.mr_lock,
&xfs_iolock_active, "xfs_iolock_active");

XFS then does this for .evict:

STATIC void
xfs_fs_evict_inode(
struct inode *inode)
{
xfs_inode_t *ip = XFS_I(inode);

trace_xfs_evict_inode(ip);

truncate_inode_pages(&inode->i_data, 0);
end_writeback(inode);
XFS_STATS_INC(vn_rele);
XFS_STATS_INC(vn_remove);
XFS_STATS_DEC(vn_active);

/*
* The iolock is used by the file system to coordinate reads,
* writes, and block truncates. Up to this point the lock
* protected concurrent accesses by users of the inode. But
* from here forward we're doing some final processing of the
* inode because we're done with it, and although we reuse the
* iolock for protection it is really a distinct lock class
* (in the lockdep sense) from before. To keep lockdep happy
* (and basically indicate what we are doing), we explicitly
* re-init the iolock here.
*/
ASSERT(!rwsem_is_locked(&ip->i_iolock.mr_lock));
mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
lockdep_set_class_and_name(&ip->i_iolock.mr_lock,
&xfs_iolock_reclaimable, "xfs_iolock_reclaimable");

xfs_inactive(ip);
}

And so lockdep treats active inodes and reclaimable inodes as
completely separate lock classes, and hence does not throw false
positive warnings. We also gave them different names so that we know
just from looking at the lockdep warnings which state the inode was
in that triggered the warning (i.e. active or reclaimable).

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/