Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
From: David Hildenbrand
Date: Fri Oct 24 2025 - 03:47:37 EST
On 24.10.25 08:50, Dave Chinner wrote:
On Thu, Oct 23, 2025 at 09:48:58AM -0600, Andreas Dilger wrote:
On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
In critical paths like truncate, correctness and safety come first.
Performance is only a secondary consideration. The overlap of
mmap() and truncate() is an area where we have had many, many bugs
and, at minimum, the current POSIX behaviour largely shields us from
serious stale data exposure events when those bugs (inevitably)
occur.
How do you prevent writes via GUP racing with truncate()?
Something like this:
CPU0 CPU1
fd = open("file")
p = mmap(fd)
whatever_syscall(p)
get_user_pages(p, &page)
truncate("file");
<write to page>
put_page(page);
Forget about truncate, go look at the comment above
writable_file_mapping_allowed() about using GUP this way.
i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
spent the past 15+ years telling people that it is unfixably broken
and they will crash their kernel or corrupt there data if they do
this.
This is not supported functionality because real world production
use ends up exposing problems with sync and background writeback
races, truncate races, fallocate() races, writes into holes, writes
into preallocated regions, writes over shared extents that require
copy-on-write, etc, etc, ad nausiem.
If anyone is using filebacked mappings like this, then when it
breaks they get to keep all the broken pieces to themselves.
Should ftruncate("file") return ETXTBUSY in this case, so that users
and applications know this doesn't work/isn't safe?
No, it is better to block waiting for the GUP to release the
reference (see below), but the general problem is that we cannot
reliably discriminate GUP references from other page cache based
references just by looking at the folio resident in the page cache.
Right. folio_maybe_dma_pinned() can have false positives for small
folios, but also temporarily for large folios (speculative pins from
GUP-fast).
In the future it might get more reliable at least for small folios when
we are able to have a dedicated pincount.
(there is still the issue that some mechanisms that should be using
pin_user_pages() are still using get_user_pages())
However, when FSDAX is being used, trucate does, in fact, block
waiting for GUP references to be release. fsdax does not use page
references to track in use pages - the filesystem metadata tracks
allocated and free pages, not the mm/ subsystem. There are no
page cache references to the pages, because there is no page
cache. Hence we can use the difference between the map count and the
reference count to determine if there are any references we cannot
forcibly unmap (e.g. GUP) just by looking at the backing store folio
state.
We can do the same for other folios as well. See folio_expected_ref_count().
Unexpected references can be from GUP, lru caches or other temporary
ones from page migration etc.
As we document for folio_expected_ref_count() it's racy for mapped
folios, though: "Calling this function on a mapped folio will not result
in a stable result, because nothing stops additional page table mappings
from coming (e.g.,fork()) or going (e.g., munmap())."
It only works reliably on unmapped folios when holding the folio lock.
--
Cheers
David / dhildenb