Re: [PATCH v2 1/1] ocfs2: split transactions in dio completion to avoid credit exhaustion

From: Heming Zhao

Date: Tue Mar 24 2026 - 02:00:17 EST


ping...

On Wed, Mar 18, 2026 at 02:40:45PM +0800, Heming Zhao wrote:
> On Fri, Mar 13, 2026 at 12:53:32PM +0800, Joseph Qi wrote:
> >
> >
> > On 3/13/26 11:17 AM, Heming Zhao wrote:
> > > On Fri, Mar 13, 2026 at 10:04:11AM +0800, Joseph Qi wrote:
> > >> Almost looks fine, minor updates below.
> > >>
> > >> On 3/13/26 12:27 AM, Heming Zhao wrote:
> > >>> During ocfs2 dio operations, JBD2 may report warnings via following call trace:
> > >>> ocfs2_dio_end_io_write
> > >>> ocfs2_mark_extent_written
> > >>> ocfs2_change_extent_flag
> > >>> ocfs2_split_extent
> > >>> ocfs2_try_to_merge_extent
> > >>> ocfs2_extend_rotate_transaction
> > >>> ocfs2_extend_trans
> > >>> jbd2__journal_restart
> > >>> start_this_handle
> > >>> output: JBD2: kworker/6:2 wants too many credits credits:5450 rsv_credits:0 max:5449
> > >>>
> > >>> To prevent exceeding the credits limit, modify ocfs2_dio_end_io_write() to
> > >>> handle each extent in a separate transaction.
> > >>>
> > >>> Additionally, relocate ocfs2_del_inode_from_orphan(). The orphan inode should
> > >>> only be removed from the orphan list after the extent tree update is complete.
> > >>> this ensures that if a crash occurs in the middle of extent tree updates, we
> > >>> won't leave stale blocks beyond EOF.
> > >>>
> > >>> This patch also removes the only call to ocfs2_assure_trans_credits(), which
> > >>> was introduced by commit be346c1a6eeb ("ocfs2: fix DIO failure due to
> > >>> insufficient transaction credits").
> > >>>
> > >>> Finally, thanks to Jans for providing the bug fix prototype and suggestions.
> > >>>
> > >>> Suggested-by: Jan Kara <jack@xxxxxxx>
> > >>> Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
> > >>> ---
> > >>> fs/ocfs2/aops.c | 58 ++++++++++++++++++++++++-------------------------
> > >>> 1 file changed, 29 insertions(+), 29 deletions(-)
> > >>>
> > >>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> > >>> index 09146b43d1f0..91997b330d39 100644
> > >>> --- a/fs/ocfs2/aops.c
> > >>> +++ b/fs/ocfs2/aops.c
> > >>> @@ -2294,18 +2294,6 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> > >>> goto out;
> > >>> }
> > >>>
> > >>> - /* Delete orphan before acquire i_rwsem. */
> > >>> - if (dwc->dw_orphaned) {
> > >>> - BUG_ON(dwc->dw_writer_pid != task_pid_nr(current));
> > >>> -
> > >>> - end = end > i_size_read(inode) ? end : 0;
> > >>> -
> > >>> - ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh,
> > >>> - !!end, end);
> > >>> - if (ret < 0)
> > >>> - mlog_errno(ret);
> > >>> - }
> > >>> -
> > >>> down_write(&oi->ip_alloc_sem);
> > >>> di = (struct ocfs2_dinode *)di_bh->b_data;
> > >>>
> > >>> @@ -2326,44 +2314,56 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> > >>>
> > >>> credits = ocfs2_calc_extend_credits(inode->i_sb, &di->id2.i_list);
> > >>>
> > >>> - handle = ocfs2_start_trans(osb, credits);
> > >>> - if (IS_ERR(handle)) {
> > >>> - ret = PTR_ERR(handle);
> > >>> - mlog_errno(ret);
> > >>> - goto unlock;
> > >>> - }
> > >>> - ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
> > >>> - OCFS2_JOURNAL_ACCESS_WRITE);
> > >>> - if (ret) {
> > >>> - mlog_errno(ret);
> > >>> - goto commit;
> > >>> - }
> > >>> -
> > >>> list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
> > >>> - ret = ocfs2_assure_trans_credits(handle, credits);
> > >>> - if (ret < 0) {
> > >>> + handle = ocfs2_start_trans(osb, credits);
> > >>> + if (IS_ERR(handle)) {
> > >>> + ret = PTR_ERR(handle);
> > >>> mlog_errno(ret);
> > >>> break;
> > >>
> > >> I'd rather goto unlock directly without update i_size in case error.
> > >
> > > agree
> > >>
> > >>> }
> > >>> + ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
> > >>> + OCFS2_JOURNAL_ACCESS_WRITE);
> > >>> + if (ret) {
> > >>> + mlog_errno(ret);
> > >>> + ocfs2_commit_trans(osb, handle);
> > >>> + break;
> > >>
> > >> Ditto.
> > >
> > > agree
> > >>
> > >>> + }
> > >>> ret = ocfs2_mark_extent_written(inode, &et, handle,
> > >>> ue->ue_cpos, 1,
> > >>> ue->ue_phys,
> > >>> meta_ac, &dealloc);
> > >>> if (ret < 0) {
> > >>> mlog_errno(ret);
> > >>> + ocfs2_commit_trans(osb, handle);
> > >>> break;
> > >>
> > >> Ditto.
> > >
> > > The existing code still updates i_size even if ocfs2_mark_extent_written()
> > > returns an error. I am not certain whether updating i_size in this case is
> > > correct, but I prefer to maintain the original logic for now.
> > > Does that seem reasonable to you?
> > >
> >
> > Since it returns 0 for unwritten extents, it looks fine.
> >
> > >> BTW, we can move ocfs2_commit_trans() rightly after ocfs2_mark_extent_written()
> > >> to save the extra call in case error.
> > >
> > > agree
> > >
> > > I ran a DIO test using 64MB chunk size (e.g.: fio --bs=64M), the dwc->dw_zero_count
> > > is about 12107 ~ 12457.
> > > Regarding the transaction splitting in the unwritten handling loop: this patch
> > > introduces a minor performance regression. I would like to merge some of the
> > > transaction calls. e.g.: by starting a new transaction every 100 or 200 iterations.
> > > What do you think of this approach?
> > >
> >
> > Seems we can limit max credit for a single transaction here?
> > If exceeds, restart a new one.
> >
> > Joseph,
> > Thanks
>
> sorry for the late reply, I spend some time running the tests.
> I have created two different types of patches (see below) to address this issue.
>
> time consumption results:
>
> ocfs2-dynamic-... patch
> real 2m9.900s
> user 0m0.333s
> sys 0m22.687s
>
> %200 of ocfs2-split-trans-... patch
> real 1m48.901s
> user 0m0.306s
> sys 0m23.019s
>
> %500 of ocfs2-split-trans-... patch
> real 1m49.350s
> user 0m0.299s
> sys 0m22.718s
>
> As shown above, the fixed "500% mode" is faster than the dynamic style.
> Which approach do you prefer?
>
> ---
> the test script:
> ```
> #!/bin/bash
> # thanks Wolfgang
>
> DISK="vdc"
> MOUNTPOINT="/mnt/ocfs2"
> TEST_FILE="$MOUNTPOINT/test.bin"
>
> if lsblk | grep "$DISK.*$MOUNTPOINT"; then
> umount $MOUNTPOINT
> fi
> echo "------mkfs & mount----------------"
> mkfs.ocfs2 -b 4K -C 4K --cluster-stack=pcmk -N 2 /dev/$DISK --cluster-name=hacluster
> mount.ocfs2 -t ocfs2 /dev/$DISK $MOUNTPOINT
>
> sleep 1
> echo "------fallocate----------------"
> fallocate -l 2G "$TEST_FILE"
>
> sleep 1
> echo "------setup file----------------"
> fio --name=setup --filename="$TEST_FILE" --rw=randwrite --bs=4k \
> --ioengine=libaio --iodepth=64 --size=2G --io_size=512M \
> --direct=0 --end_fsync=1 --minimal
>
> sync
> sleep 3
>
> echo "-----start test-----------------"
> time fio --name=tst --filename="$TEST_FILE" --rw=write --bs=64M \
> --ioengine=libaio --iodepth=16 --size=2G --direct=1
> ```
>
> ---
> patch: 0001-ocfs2-dynamic-extending-jbd2-credits-during-dio-end-.patch
> Subject: [PATCH] ocfs2: dynamic extending jbd2 credits during dio end path
>
> ---
> fs/ocfs2/aops.c | 31 +++++++++++++++++--------------
> fs/ocfs2/journal.c | 34 ++++++++++++++++++++++++++++++----
> fs/ocfs2/journal.h | 5 +++--
> 3 files changed, 50 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 7a65d5a36a3e..d9be765662dc 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2279,6 +2279,10 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> handle_t *handle = NULL;
> loff_t end = offset + bytes;
> int ret = 0, credits = 0;
> + /* do the same job of jbd2_max_user_trans_buffers() */
> + journal_t *journal = osb->journal->j_journal;
> + int jbd2_max = journal->j_max_transaction_buffers -
> + journal->j_transaction_overhead_buffers;
>
> ocfs2_init_dealloc_ctxt(&dealloc);
>
> @@ -2295,18 +2299,6 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> goto out;
> }
>
> - /* Delete orphan before acquire i_rwsem. */
> - if (dwc->dw_orphaned) {
> - BUG_ON(dwc->dw_writer_pid != task_pid_nr(current));
> -
> - end = end > i_size_read(inode) ? end : 0;
> -
> - ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh,
> - !!end, end);
> - if (ret < 0)
> - mlog_errno(ret);
> - }
> -
> down_write(&oi->ip_alloc_sem);
> di = (struct ocfs2_dinode *)di_bh->b_data;
>
> @@ -2341,8 +2333,10 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> }
>
> list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
> - ret = ocfs2_assure_trans_credits(handle, credits);
> - if (ret < 0) {
> + ret = ocfs2_assure_trans_credits(inode, &handle, di_bh, credits, jbd2_max);
> + if (ret == 1) {
> + goto unlock;
> + } else if (ret < 0) {
> mlog_errno(ret);
> break;
> }
> @@ -2365,6 +2359,15 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> ocfs2_commit_trans(osb, handle);
> unlock:
> up_write(&oi->ip_alloc_sem);
> +
> + if (dwc->dw_orphaned) {
> + BUG_ON(dwc->dw_writer_pid != task_pid_nr(current));
> +
> + ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh, 0, 0);
> + if (ret < 0)
> + mlog_errno(ret);
> + }
> +
> ocfs2_inode_unlock(inode, 1);
> brelse(di_bh);
> out:
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index e5f58ff2175f..57ad69d19494 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -474,14 +474,40 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
> * credits. Similar notes regarding data consistency and locking implications
> * as for ocfs2_extend_trans() apply here.
> */
> -int ocfs2_assure_trans_credits(handle_t *handle, int nblocks)
> +int ocfs2_assure_trans_credits(struct inode *inode, handle_t **i_handle,
> + struct buffer_head *bh, int nblocks, int jbd2_max)
> {
> + handle_t *handle = *i_handle;
> int old_nblks = jbd2_handle_buffer_credits(handle);
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + int ret, wanted;
> + int plan = nblocks << 1; /* Double the credits to prevent boundary issues */
>
> trace_ocfs2_assure_trans_credits(old_nblks);
> - if (old_nblks >= nblocks)
> - return 0;
> - return ocfs2_extend_trans(handle, nblocks - old_nblks);
> +
> + wanted = old_nblks + plan;
> +
> + if (wanted < jbd2_max) {
> + if (old_nblks < nblocks)
> + return ocfs2_extend_trans(handle, nblocks - old_nblks);
> + else
> + return 0;
> + }
> +
> + ocfs2_commit_trans(osb, handle);
> + handle = ocfs2_start_trans(osb, plan);
> + if (IS_ERR(handle)) {
> + mlog_errno(PTR_ERR(handle));
> + return 1; /* caller must not commit trans for this error */
> + }
> +
> + ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), bh,
> + OCFS2_JOURNAL_ACCESS_WRITE);
> + if (ret)
> + mlog_errno(ret);
> +
> + *i_handle = handle;
> + return ret;
> }
>
> /*
> diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
> index 6397170f302f..08a76ffb870a 100644
> --- a/fs/ocfs2/journal.h
> +++ b/fs/ocfs2/journal.h
> @@ -244,8 +244,9 @@ handle_t *ocfs2_start_trans(struct ocfs2_super *osb,
> int ocfs2_commit_trans(struct ocfs2_super *osb,
> handle_t *handle);
> int ocfs2_extend_trans(handle_t *handle, int nblocks);
> -int ocfs2_assure_trans_credits(handle_t *handle,
> - int nblocks);
> +int ocfs2_assure_trans_credits(struct inode *inode,
> + handle_t **i_handle, struct buffer_head *bh,
> + int nblocks, int jbd2_max);
> int ocfs2_allocate_extend_trans(handle_t *handle,
> int thresh);
>
> --
> 2.43.0
>
> ----------
> patch: 0001-ocfs2-split-trans-in-end-dio-to-avoid-credit-exhaust.patch
> Subject: [PATCH] ocfs2: split trans in end dio to avoid credit exhaustion
>
> ---
> fs/ocfs2/aops.c | 77 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 45 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 7a65d5a36a3e..9053f1ee587a 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2279,6 +2279,7 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> handle_t *handle = NULL;
> loff_t end = offset + bytes;
> int ret = 0, credits = 0;
> + int cnt = 0, trans_start = 0, mod = 500;
>
> ocfs2_init_dealloc_ctxt(&dealloc);
>
> @@ -2295,18 +2296,6 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
> goto out;
> }
>
> - /* Delete orphan before acquire i_rwsem. */
> - if (dwc->dw_orphaned) {
> - BUG_ON(dwc->dw_writer_pid != task_pid_nr(current));
> -
> - end = end > i_size_read(inode) ? end : 0;
> -
> - ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh,
> - !!end, end);
> - if (ret < 0)
> - mlog_errno(ret);
> - }
> -
> down_write(&oi->ip_alloc_sem);
> di = (struct ocfs2_dinode *)di_bh->b_data;
>
> @@ -2327,44 +2316,68 @@ static int ocfs2_dio_end_io_write(struct inode *inode,
>
> credits = ocfs2_calc_extend_credits(inode->i_sb, &di->id2.i_list);
>
> - handle = ocfs2_start_trans(osb, credits);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - mlog_errno(ret);
> - goto unlock;
> - }
> - ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
> - OCFS2_JOURNAL_ACCESS_WRITE);
> - if (ret) {
> - mlog_errno(ret);
> - goto commit;
> - }
> -
> list_for_each_entry(ue, &dwc->dw_zero_list, ue_node) {
> - ret = ocfs2_assure_trans_credits(handle, credits);
> - if (ret < 0) {
> - mlog_errno(ret);
> - break;
> + if ((cnt++ % mod) == 0) {
> + trans_start = 1;
> + handle = ocfs2_start_trans(osb, credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + mlog_errno(ret);
> + goto unlock;
> + }
> + ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), di_bh,
> + OCFS2_JOURNAL_ACCESS_WRITE);
> + if (ret) {
> + mlog_errno(ret);
> + goto unlock;
> + }
> }
> ret = ocfs2_mark_extent_written(inode, &et, handle,
> ue->ue_cpos, 1,
> ue->ue_phys,
> meta_ac, &dealloc);
> + if ((cnt % mod) == 0) {
> + ocfs2_commit_trans(osb, handle);
> + if (ret < 0) {
> + mlog_errno(ret);
> + break;
> + }
> + trans_start = 0;
> + }
> + }
> +
> + //commit if above loop doesn't do
> + if (trans_start) {
> + ocfs2_commit_trans(osb, handle);
> if (ret < 0) {
> mlog_errno(ret);
> - break;
> }
> }
>
> if (end > i_size_read(inode)) {
> + handle = ocfs2_start_trans(osb, credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + mlog_errno(ret);
> + goto unlock;
> + }
> ret = ocfs2_set_inode_size(handle, inode, di_bh, end);
> if (ret < 0)
> mlog_errno(ret);
> + ocfs2_commit_trans(osb, handle);
> }
> -commit:
> - ocfs2_commit_trans(osb, handle);
> +
> unlock:
> up_write(&oi->ip_alloc_sem);
> +
> + /* everything looks good, let's start the cleanup */
> + if (dwc->dw_orphaned) {
> + BUG_ON(dwc->dw_writer_pid != task_pid_nr(current));
> +
> + ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh, 0, 0);
> + if (ret < 0)
> + mlog_errno(ret);
> + }
> ocfs2_inode_unlock(inode, 1);
> brelse(di_bh);
> out:
> --
> 2.43.0
>
>
> Thanks,
> Heming