Re: [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)

From: Amir Goldstein

Date: Tue Mar 24 2026 - 03:54:19 EST


On Mon, Mar 23, 2026 at 11:03 PM Chenglong Tang
<chenglongtang@xxxxxxxxxx> wrote:
>
> Hi all,

Hi Chenglong,

>
> We are tracking a severe performance regression in Google's
> Container-Optimized OS (COS) that appeared when moving from the 6.6
> LTS kernel to the 6.12 LTS kernel.
>
> Under concurrent CI workloads (specifically, many containers doing
> Python package compilation / .pyc generation simultaneously), the 6.12
> kernel suffers from massive jbd2 journal contention. Processes hang
> for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the
> exact same workload completes in ~4 seconds.
>
> # Environment:
> * Host FS: ext4 (backed by standard cloud block storage)
> * Container FS: OverlayFS (Docker)
> * Machine: n2d-highmem-96 (96 vCPU, high memory)
> * Good Kernel: 6.6.87
> * Bad Kernels: 6.12.55, 6.12.68
>
> # The Bottleneck
> During the 20+ second hang, `cat /proc/<pid>/stack` reveals three
> distinct groups of blocked processes thrashing on the jbd2 journal.
> The OverlayFS copy-up mechanism seems to be generating so many
> synchronous ext4 transactions that it exhausts the jbd2 transaction
> buffers.
>
> 1. Journal Space Exhaustion (Waiting to start transaction):
> [<0>] __jbd2_log_wait_for_space+0xa3/0x240
> [<0>] start_this_handle+0x42d/0x8a0
> [<0>] jbd2__journal_start+0x103/0x1e0
> [<0>] __ext4_journal_start_sb+0x129/0x1c0
> [<0>] __ext4_new_inode+0x7cd/0x1290
> [<0>] ext4_create+0xbc/0x1b0
> [<0>] vfs_create+0x192/0x250
> [<0>] ovl_create_real+0xd5/0x170
> [<0>] ovl_create_or_link+0x1d7/0x7f0
>
> 2. VFS Rename / Copy-up Contention (Blocked by the slow sync):
> [<0>] lock_rename+0x29/0x50
> [<0>] ovl_copy_up_flags+0x84c/0x12e0
> [<0>] ovl_create_object+0x4a/0x120
> [<0>] vfs_mkdir+0x1aa/0x260
> [<0>] do_mkdirat+0xb9/0x240
>
> 3. Synchronous Flush Blocking:
> [<0>] jbd2_log_wait_commit+0x107/0x150
> [<0>] jbd2_journal_force_commit+0x9c/0xc0
> [<0>] ext4_sync_file+0x278/0x310
> [<0>] ovl_sync_file+0x2f/0x50
> [<0>] ovl_copy_up_metadata+0x455/0x4b0
>
> # Minimal Reproducer
> The issue is easily reproducible by triggering 20 concurrent cold
> Python imports in Docker, which forces OverlayFS to copy-up the
> `__pycache__` directories and write the `.pyc` files.
>
> ```bash
> # 1. Build a clean image with no pre-compiled bytecode
> cat << 'EOF' > Dockerfile
> FROM python:3.10-slim
> RUN pip install --quiet google-cloud-compute
> RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} +
> EOF
> docker build -t clean-import-test .
>
> # 2. Fire 20 concurrent imports
> for i in {1..20}; do
> docker run --rm clean-import-test bash -c 'time python -c "import
> google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 &
> done
> wait
> grep "real" clean_test_cold_*.log
> ```

I don't understand.

You write that Python imports in Docker forces OverlayFS to copy-up the
`__pycache__` directories, but the prep stage removes all the
`__pycache__` directories.

My guess would be that rm -rf __pycache__ would generate a lot of
metadata copy ups, but you write that the issue occurs during the
2nd stage. Maybe I misunderstood.

Please try to figure out which and how many copy up objects this translates to
for directories, for files?

>
> On 6.6.87, all 20 containers finish in ~4.3s.
> On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk
> writes completely mitigates the regression on 6.12 (using
> PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O
> contention issue rather than a CPU scheduling one.
>
> Because the regression spans from 6.6 to 6.12, bisection is quite
> heavy. Before we initiate a full kernel bisect, does this symptom ring
> a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS
> metacopy/sync changes introduced during this window?
>
> Any pointers or patches you'd like us to test would be greatly appreciated.
>

Very high suspect:

7d6899fb69d25 ovl: fsync after metadata copy-up

As you can see from this discussion [1] this performance regression
was somewhat anticipated:

"Now we just need to hope that users won't come shouting about
performance regressions."

[1] https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@xxxxxxxxxxxxxx/

With metacopy disabled this change introduced fsyncs on metadata-only
changes made my overlayfs which could generate a lot of journal stress
and explain the regression.

But we had not anticipated that workloads could be affected with
metacopy disabled, because it was anticipated that data fsync
would be the more significant bottleneck.

Do your containers have metacopy enabled?
If not, why not? Is it because metacopy is conflicting with some
other overlayfs feature that you need like userxattr?

Thinking out loud, I wonder if metadata copy up code would benefit from
calling export_ops->commit_metadata() when supported by upper fs
instead of open+vfs_fsync(), but I doubt if that would relieve journal stress
in this case.

Anyway, please see if forcing metadata_fsync off solves the regression
and I will stage the original patch from Fei to make metadata_fsync
opt-in.

Thanks,
Amir.

--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -1154,7 +1154,7 @@ static int ovl_copy_up_one(struct dentry
*parent, struct dentry *dentry,
* that will hurt performance of workloads such as chown -R, so we
* only fsync on data copyup as legacy behavior.
*/
- ctx.metadata_fsync = !OVL_FS(dentry->d_sb)->config.metacopy &&
+ ctx.metadata_fsync = 0 && !OVL_FS(dentry->d_sb)->config.metacopy &&
(S_ISREG(ctx.stat.mode) || S_ISDIR(ctx.stat.mode));
ctx.metacopy = ovl_need_meta_copy_up(dentry, ctx.stat.mode, flags);