Re: [PATCH] ocfs2: fix orphan inode disk leak in ocfs2_dio_end_io() on I/O error

From: Marco Elver

Date: Fri Jun 12 2026 - 09:03:18 EST


On Fri, Jun 12, 2026 at 09:27AM +0800, Heming Zhao wrote:
> On Thu, Jun 11, 2026 at 05:01:50PM +0200, Marco Elver wrote:
> > When an extending direct I/O write or a direct I/O write racing with an
> > unlink is initiated, ocfs2_direct_IO() places the user inode into the
> > system orphan directory and sets the OCFS2_DIO_ORPHANED_FL flag to
> > ensure defined behavior and crash consistency.
> >
> > However, if the direct I/O request encounters an error or gets
> > asynchronous cancellation (bytes <= 0), the VFS completion hook
> > ocfs2_dio_end_io() bypasses ocfs2_dio_end_io_write() entirely and
> > executes ocfs2_dio_free_write_ctx(). This completely omits the teardown
> > of the orphan entry, leaking the user inode in the orphan directory and
> > leaving the OCFS2_DIO_ORPHANED_FL disk flag set.
> >
> > Because the OCFS2_DIO_ORPHANED_FL flag remains active, subsequent VFS
> > final inode eviction (ocfs2_delete_inode) observes the flag, assumes a
> > direct I/O write is actively in progress, and refuses to wipe the inode.
> > This results in an irrecoverable disk storage and resource leak that can
> > only be reclaimed if the cluster unmounts or crashes.
> >
> > Fix this by ensuring that ocfs2_dio_end_io() inspects dw_orphaned even
> > when an I/O error occurs, and executes ocfs2_del_inode_from_orphan() to
> > liberate the inode before destroying the in-memory write context.
> >
> > Fixes: 5040f8df56fb ("ocfs2: free up write context when direct IO failed")
> > Assisted-by: Antigravity:Gemini
> > Signed-off-by: Marco Elver <elver@xxxxxxxxxx>
> > ---
> > fs/ocfs2/aops.c | 17 +++++++++++++++--
> > 1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> > index 4acdbb70882c..ad3f2057e26e 100644
> > --- a/fs/ocfs2/aops.c
> > +++ b/fs/ocfs2/aops.c
> > @@ -2419,11 +2419,24 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
> > mlog_ratelimited(ML_ERROR, "Direct IO failed, bytes = %lld",
> > (long long)bytes);
> > if (private) {
> > - if (bytes > 0)
> > + if (bytes > 0) {
> > ret = ocfs2_dio_end_io_write(inode, private, offset,
> > bytes);
> > - else
> > + } else {
> > + struct ocfs2_dio_write_ctxt *dwc = private;
> > +
> > + if (dwc->dw_orphaned) {
> > + struct buffer_head *di_bh = NULL;
> > +
> > + if (ocfs2_inode_lock(inode, &di_bh, 1) == 0) {
> > + ocfs2_del_inode_from_orphan(OCFS2_SB(inode->i_sb),
> > + inode, di_bh, 0, 0);
> > + ocfs2_inode_unlock(inode, 1);
> > + brelse(di_bh);
> > + }
>
> Calling only ocfs2_del_inode_from_orphan() without ocfs2_truncate_file() will
> leave stale blocks beyond the EOF.

Right.

> I think the existing OCFS2 code already handles error/crash cases for orphaned
> inodes, and this "leaking" behavior is by design.
> please refer to ocfs2_recover_orphans() and ocfs2_add_inode_to_orphan().

Periodic scans skip direct I/O entries to avoid racing with active
direct I/O on live nodes:

In fs/ocfs2/journal.c:ocfs2_orphan_filldir():

/* do not include dio entry in case of orphan scan */
if ((p->orphan_reco_type == ORPHAN_NO_NEED_TRUNCATE) &&
(!strncmp(name, OCFS2_DIO_ORPHAN_PREFIX,
OCFS2_DIO_ORPHAN_PREFIX_LEN)))
return true;

Is something else recovering them?