David Hildenbrand <david@xxxxxxxxxx> writes:
On 16.08.24 19:45, Ackerley Tng wrote:
<snip>
IIUC folio_lock() isn't a prerequisite for taking a refcount on the
folio.
Right, to do folio_lock() you only have to guarantee that the folio
cannot get freed concurrently. So you piggyback on another reference
(you hold indirectly).
Even if we are able to figure out a "safe" refcount, and check that the
current refcount == "safe" refcount before removing from direct map,
what's stopping some other part of the kernel from taking a refcount
just after the check happens and causing trouble with the folio's
removal from direct map?
Once the page was unmapped from user space, and there were no additional
references (e.g., GUP, whatever), any new references can only be
(should, unless BUG :) ) temporary speculative references that should
not try accessing page content, and that should back off if the folio is
not deemed interesting or cannot be locked. (e.g., page
migration/compaction/offlining).
I thought about it again - I think the vmsplice() cases are taken care
of once we check that the folios are not mapped into userspace, since
vmsplice() reads from a mapping.
splice() reads from the fd directly, but that's taken care since
guest_memfd doesn't have a .splice_read() handler.
Reading /proc/pid/mem also requires the pages to first be mapped, IIUC,
otherwise the pages won't show up, so checking that there are no more
mappings to userspace takes care of this.
Of course, there are some corner cases (kgdb, hibernation, /proc/kcore),
but most of these can be dealt with in one way or the other (make these
back off and not read/write page content, similar to how we handled it
for secretmem).
Does that really leave us with these corner cases? And so perhaps we
could get away with just taking the folio_lock() to keep away the
speculative references? So something like
1. Check that the folio is not mapped and not pinned.
2. folio_lock() all the folios about to be removed from direct map
-- With the lock, all other accesses should be speculative --
3. Check that the refcount == "safe" refcount
3a. Unlock and return to userspace with -EAGAIN
4. Remove from direct map
5. folio_unlock() all those folios
Perhaps a very naive question: can the "safe" refcount be statically
determined by walking through the code and counting where refcount is
expected to be incremented?