On Fri 12-07-24 10:37:08, Theodore Ts'o wrote:
On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
Problem description
===================
The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes(See
function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list().
Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
do inode lookup in the inode evicting callback function, if the inode
lookup is operated under the inode lru traversing context, deadlock
problems may happen.
Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
if ea_inode feature is enabled, the lookup process will be stuck under
the evicting context like this:
1. File A has inode i_reg and an ea inode i_ea
2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
3. Then, following three processes running like this:
PA PB
echo 2 > /proc/sys/vm/drop_caches
shrink_slab
prune_dcache_sb
// i_reg is added into lru, lru->i_ea->i_reg
prune_icache_sb
list_lru_walk_one
inode_lru_isolate
i_ea->i_state |= I_FREEING // set inode state
i_ea->i_state |= I_FREEING // set inode state
Um, I don't see how this can happen. If the ea_inode is in use,
i_count will be greater than zero, and hence the inode will never be
go down the rest of the path in inode_lru_inode():
if (atomic_read(&inode->i_count) ||
...) {
list_lru_isolate(lru, &inode->i_lru);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}
Do you have an actual reproduer which triggers this? Or would this
happen be any chance something that was dreamed up with DEPT?
No, it looks like a real problem and I agree with the analysis. We don't
hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The
normal inode just owns that that special on-disk xattr reference. Standard
inode references are acquired and dropped as needed.
And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from
evict() needs to lookup the ea_inode and iget() it. So if we are processing
a list of inodes to dispose, all inodes have I_FREEING bit already set and
if ea_inode and its parent normal inode are both in the list, then the
evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock.
Normally we don't hit this path because LRU list walk is not handling
inodes with 0 link count. But a race with unlink can make that happen with
iput() from inode_lru_isolate().
I'm pondering about the best way to fix this. Maybe we could handle the
need for inode pinning in inode_lru_isolate() in a similar way as in
writeback code so that last iput() cannot happen from inode_lru_isolate().
In writeback we use I_SYNC flag to pin the inode and evict() waits for this
flag to clear. I'll probably sleep to it and if I won't find it too
disgusting to live tomorrow, I can code it.