The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:
- It shouldn't be written to core dumps.
* Easy: VM_DONTDUMP.
- It should be zeroed on fork.
* Easy: VM_WIPEONFORK.
- It shouldn't be written to swap.
* Uh-oh: mlock is rlimited.
* Uh-oh: mlock isn't inherited by forks.
It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:
1) Due to being wiped during fork(), the vDSO code is already robust to
having the contents of the pages it reads zeroed out midway through
the function's execution.
2) In the absolute worst case of whatever contingency we're coding for,
we have the option to fallback to the getrandom() syscall, and
everything is fine.
3) The buffers the function uses are only ever useful for a maximum of
60 seconds -- a sort of cache, rather than a long term allocation.
These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:
a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.
This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.
This way, allocations used by vDSO getrandom() can use:
VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
And there will be no problem with using memory when not in use, not
wiping on fork(), coredumps, or writing out to swap.
In order to let vDSO getrandom() use this, expose these via mmap(2) as
well, giving MAP_WIPEONFORK, MAP_DONTDUMP, and MAP_DROPPABLE.
Finally, the provided self test ensures that this is working as desired.
Cc: linux-mm@xxxxxxxxx
Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
---
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8c6cd8825273..57b8dad9adcc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
may_expand_vm(mm, oldflags, nrpages))
return -ENOMEM;
if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
- VM_SHARED|VM_NORESERVE))) {
+ VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
charged = nrpages;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..56d7535d5cf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1397,7 +1397,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
VM_BUG_ON_VMA(address < vma->vm_start ||
address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
- __folio_set_swapbacked(folio);
+ /* VM_DROPPABLE mappings don't swap; instead they're just dropped when
+ * under memory pressure. */
+ if (!(vma->vm_flags & VM_DROPPABLE))
+ __folio_set_swapbacked(folio);
__folio_set_anon(folio, vma, address, true);
if (likely(!folio_test_large(folio))) {
@@ -1841,7 +1844,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
- !folio_test_dirty(folio)) {
+ (!folio_test_dirty(folio) ||
+ /* Unlike MADV_FREE mappings, VM_DROPPABLE
+ * ones can be dropped even if they've
+ * been dirtied. */
+ (vma->vm_flags & VM_DROPPABLE))) {
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}
@@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* discarded. Remap the page to page table.
*/
set_pte_at(mm, address, pvmw.pte, pteval);
- folio_set_swapbacked(folio);
+ /* Unlike MADV_FREE mappings, VM_DROPPABLE ones
+ * never get swap backed on failure to drop. */
+ if (!(vma->vm_flags & VM_DROPPABLE))
+ folio_set_swapbacked(folio);
ret = false;
page_vma_mapped_walk_done(&pvmw);
break;