Re: [PATCH v7 3/4] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Jan Kara
Date: Wed Jun 17 2026 - 14:47:15 EST
On Tue 16-06-26 23:15:57, Yun Zhou wrote:
> Calling iput() on EA inodes while holding xattr_sem or a jbd2 handle
> can trigger write_inode_now() -> ext4_writepages() -> s_writepages_rwsem,
> creating a lock ordering issue during mount (!SB_ACTIVE).
>
> Add ext4_put_ea_inode() which safely releases EA inode references:
> when SB_ACTIVE, it calls iput() directly (write_inode_now cannot be
> triggered); during mount (!SB_ACTIVE), it queues the inode on a per-sb
> lock-free llist and schedules a worker to call iput() in a clean
> context without holding any ext4 locks.
>
> Convert the iput in ext4_xattr_block_set()'s "Drop the previous xattr
> block" path to use ext4_xattr_inode_array_free_deferred(), which
> releases EA inodes via ext4_put_ea_inode(). This path previously called
> ext4_xattr_inode_array_free() (synchronous iput) while holding xattr_sem
> and a jbd2 handle.
>
> The worker is flushed in ext4_put_super() before journal destruction to
> ensure all pending EA inode cleanup completes while the journal is still
> available.
>
> Signed-off-by: Yun Zhou <yun.zhou@xxxxxxxxxxxxx>
Yes, this goes in the direction I intended. But I have couple of
suggestions:
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..690202303269 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1706,6 +1706,11 @@ struct ext4_sb_info {
> struct ext4_es_stats s_es_stats;
> struct mb_cache *s_ea_block_cache;
> struct mb_cache *s_ea_inode_cache;
> +
> + /* Deferred iput for EA inodes to avoid lock ordering issues */
> + struct llist_head s_ea_inode_to_free;
> + struct work_struct s_ea_inode_work;
> +
I'd probably use delayed work and schedule it with a delay of one jiffie so
that some inodes can accumulate before we process them which should reduce
the amount of task switching to workqueues.
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..b777bb0a81ea 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1308,6 +1308,9 @@ static void ext4_put_super(struct super_block *sb)
> destroy_workqueue(sbi->rsv_conversion_wq);
> ext4_release_orphan_info(sb);
>
> + /* Flush deferred EA inode iputs before destroying journal */
> + flush_work(&sbi->s_ea_inode_work);
> +
This should happen earlier in ext4_put_super(). At this place quotas were
already turned off and so quota accounting would go wrong.
> @@ -3025,6 +3028,74 @@ void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *ea_inode_array)
> kfree(ea_inode_array);
> }
>
> +static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
> + struct ext4_xattr_inode_array *array)
> +{
> + int idx;
> +
> + if (array == NULL)
> + return;
> +
> + for (idx = 0; idx < array->count; ++idx)
> + ext4_put_ea_inode(sb, array->inodes[idx]);
> + kfree(array);
> +}
The array of EA inodes used in xattr handling is just another mechanism
used for delaying iput() of EA inodes. It doesn't make sense to stack these
to one on top of another. Just completely replace the array mechanism with
always deferring iput of EA inode into the workqueue.
> +struct ext4_ea_iput_entry {
> + struct llist_node node;
> + struct inode *inode;
> +};
> +
> +/*
> + * Worker function for deferred EA inode iput. Processes all inodes queued
> + * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
> + */
> +void ext4_ea_inode_work(struct work_struct *work)
> +{
> + struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
> + s_ea_inode_work);
> + struct llist_node *node = llist_del_all(&sbi->s_ea_inode_to_free);
> + struct llist_node *next;
> +
> + while (node) {
> + struct ext4_ea_iput_entry *entry = container_of(node,
> + struct ext4_ea_iput_entry, node);
> + next = node->next;
> + iput(entry->inode);
> + kfree(entry);
> + node = next;
> + }
> +}
Allocating ext4_ea_iput_entry for dropping each inode is somewhat wasteful.
I want to suggest another scheme (somewhat more involved but more efficient
scheme):
1) Create a VFS helper bool iput_if_not_last(struct inode *inode) which
drops inode reference if it is not the last one (and returns true in that
case). Basically:
bool iput_if_not_last(struct inode *inode)
{
return atomic_add_unless(&inode->i_count, -1, 1);
}
This needs to be a separate patch as it should get vetting from VFS
maintainers.
2) Use iput_if_not_last() in ext4_put_ea_inode(). If it returns true, we
are done. Otherwise we know we were at least for a moment holders of the
last inode reference, so we link the inode to the list of inodes to drop
through llist_node embedded in ext4_inode_info. We cannot race with anybody
else trying to link the same inode into the list because we hold one inode
ref and so nobody else can hit this "I was holding the last ref" path.
I'd union this llist_node say with xattr_sem which is unused for EA inodes
to avoid growing ext4_inode_info.
This way we avoid offloading unless really necessary and we don't have to
do allocations just to drop EA inode ref.
Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR