Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache

From: Gregg Leventhal

Date: Tue Jun 30 2026 - 13:19:37 EST


Also, just to be explicit and prevent some potentially-wasted time:
This does not repro on XFS! You need to point the reproducer at
Btrfs or Ext4 (or some other susceptible file system, but those are
the only ones Eric and I have confirmed, personally).


On Tue, Jun 30, 2026 at 1:01 PM Gregg Leventhal
<gleventhal@xxxxxxxxxxxxxx> wrote:
>
> Hello,
>
> We (Gregg Leventhal <gleventhal@xxxxxxxxxxxxxx> and Eric Hagberg
>
> <ehagberg@xxxxxxxxxxxxxx>) have a reproducible data-loss issue involving file
>
> THPs and write-open, impacting filesystems that do not support
> writable large folios.
>
>
> Attached are:
>
>
> - thp_write_open_cancel_dirty_repro.c
>
> - thp-open-writeback-before-purge.patch
>
>
>
> Summary
>
> =======
>
>
> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
>
> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
>
> same file is later opened for write, do_dentry_open() notices
>
> filemap_nr_thps() and drops the page cache:
>
>
> /*
>
> * XXX: Huge page cache doesn't support writing yet. Drop all page
>
> * cache for this file before processing writes.
>
> */
>
> if (f->f_mode & FMODE_WRITE) {
>
> if (filemap_nr_thps(inode->i_mapping)) {
>
> struct address_space *mapping = inode->i_mapping;
>
>
> filemap_invalidate_lock(inode->i_mapping);
>
> unmap_mapping_range(mapping, 0, 0, 0);
>
> truncate_inode_pages(mapping, 0);
>
> filemap_invalidate_unlock(inode->i_mapping);
>
> }
>
> }
>
>
> This is unsafe if the mapping also contains dirty folios.
>
> truncate_inode_pages() is not just a clean cache-dropping primitive. It can
>
> call truncate_cleanup_folio(), which calls folio_cancel_dirty().
>
>
> In the attached reproducer, dirty appended data is discarded and later read(2)s
>
> return zeros.
>
>
> We observed this on btrfs and ext4, though most of the testing involved btrfs.
>
>
> The same issue should apply to any filesystem where file THPs can be created
>
> by READ_ONLY_THP_FOR_FS but writable large folios are not supported. The
>
> do_dentry_open() block above is also unchanged in current mainline, so this
>
> does not appear to be strictly 6.12-specific.
>
>
>
> Instrumentation
>
> ===============
>
>
> Tracing the failure shows the dirty folios being invalidated from the
>
> write-open path. INVALIDATE_DIRTY and CANCEL_DIRTY below are labels from our
>
> own probes:
>
>
> do_dentry_open / vfs_open
>
> truncate_inode_pages_range
>
> truncate_cleanup_folio
>
> btrfs_invalidate_folio
>
> folio_cancel_dirty
>
>
> A representative stack from the failing path:
>
>
> INVALIDATE_DIRTY ...
>
> btrfs_invalidate_folio
>
> truncate_cleanup_folio
>
> truncate_inode_pages_range
>
> vfs_open
>
>
> CANCEL_DIRTY ...
>
> truncate_cleanup_folio
>
> truncate_inode_pages_range
>
> vfs_open
>
>
> This confirms that the appended dirty page-cache contents are being discarded
>
> by the open-time THP cache purge rather than written back.
>
>
>
> Why this happens
>
> ================
>
>
> The do_dentry_open() code is trying to handle the fact that some filesystems
>
> do not support writing to file THPs. The problematic assumption is that
>
> dropping the page cache is a safe cache-management operation.
>
>
> It is not safe when dirty folios are present, because truncate_inode_pages()
>
> cancels their dirty state without writeback.
>
>
> Note that the read-only file THPs themselves are clean. The data that is lost
>
> is unrelated dirty folios elsewhere in the same mapping, here the appended
>
> tail, which get caught in the blanket truncate_inode_pages(mapping, 0) of the
>
> entire mapping.
>
>
>
> Suggested fix direction
>
> =======================
>
>
> Before dropping THP-bearing page cache on write-open, write back and wait for
>
> any dirty folios. After writeback completes, the folios are clean, so the
>
> subsequent truncate_inode_pages() has no dirty state to cancel and the data is
>
> safe on disk. A later read() simply repopulates the cache from disk. If
>
> writeback fails, fail the open rather than silently discarding the data.
>
>
> The attached patch does this by adding filemap_write_and_wait(mapping) before
>
> the unmap_mapping_range() / truncate_inode_pages() sequence.
>
>
> Two caveats we are aware of with this approach:
>
>
> - filemap_write_and_wait() flushes the entire mapping, so any write-open of
>
> a file with filemap_nr_thps() > 0 now forces synchronous writeback. This
>
> path already did a full unmap + truncate, so the extra cost is probably
>
> acceptable, but it is a behavior change.
>
>
> - The writeback happens before unmap_mapping_range(). That is sufficient for
>
> the reproducer, where the dirty data comes from buffered write(2), so the
>
> folios are already marked dirty. We would appreciate guidance on whether
>
> unmap should precede the writeback in order to also cover data dirtied
>
> only via a writable shared mapping.
>
>
> An alternative would be to replace truncate_inode_pages() with a
>
> clean-page-only invalidation primitive, but then dirty file THPs / dirty pages
>
> may remain in the mapping and need careful handling.
>
>
>
> Mitigation
>
> ==========
>
>
> As a temporary mitigation, setting khugepaged's scan interval very high
>
> appears to prevent the issue by effectively stopping background file THP
>
> collapse:
>
>
> echo 4294967295 >
> /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>
>
> This is not a complete fix. It reduces or disables khugepaged background
>
> collapse, including file THP collapse, and may reduce THP-related performance
>
> benefits for workloads that rely on khugepaged promotion. Fault-time anonymous
>
> THP allocation is not disabled by this knob.
>
>
> Disabling CONFIG_READ_ONLY_THP_FOR_FS also seems to mitigate, but both are
>
> suboptimal, performance-impacting trade-offs.
>
>
>
> Reproducer
>
> ==========
>
>
> The attached reproducer does the following:
>
>
> 1. Creates a regular file with non-zero data.
>
> 2. Maps part of the file read-only and uses MADV_COLLAPSE to force a file
>
> THP.
>
> 3. Opens the file for writing and appends non-zero data, leaving it dirty in
>
> page cache.
>
> 4. Closes the write fd.
>
> 5. Re-collapses a read-only file range so filemap_nr_thps(mapping) is
>
> non-zero.
>
> 6. Opens the file for write again, triggering the do_dentry_open() THP purge.
>
> 7. Reads back the appended data.
>
>
> Whether any single iteration reproduces is a race against background
>
> writeback, so let the full iteration count run. A single clean pass does not
>
> by itself prove the kernel is unaffected.
>
>
> On an affected 6.12 host:
>
>
> # ./thp_write_open_cancel_dirty_repro Maybe_corrupted_file
>
> path=Maybe_corrupted_file base_size=67108864 append_size=16384 iters=200
>
> REPRODUCED iter=0 bad_bytes=16384 first_bad=0 zero_count=16384
> append_off=67108864
>
> first 64 got: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
>
> first 64 want: 5c df 63 e6 6a ed 71 f4 78 fb 7f 03 86 0a 8d 11 ...
>
>
> The corrupted range is visible as a run of null bytes at the append offset:
>
>
> # rg --text '\x00{64}\x00*' $PWD --only-matching \
>
> --byte-offset --no-line-number \
>
> | awk -F: '{print $1, $2, length($3)}' | head -n1
>
> /root/Maybe_corrupted_file 67108864 16384
>
>
>
> We are happy to test any preferred fix direction and can provide
> additional traces, as-needed.
>
>
> Thanks,
>
> Gregg Leventhal
>
> Eric Hagberg