Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
following semantics:
1. MADV_POPULATE_READ can be used to preallocate backend memory and
prefault page tables just like manually reading each individual page.
This will not break any COW mappings -- e.g., it will populate the
shared zeropage when applicable.
Please clarify what is meant by "backend memory". As far as I can tell
from looking at the code, MADV_POPULATE_READ on file mappings will
allocate zeroed memory in the page cache, and map it as readonly pages
into userspace, but any attempt to actually write to that memory will
trigger the filesystem's ->page_mkwrite handler; and e.g. ext4 will
only try to allocate disk blocks at that point, which may fail. So as
far as I can tell, for files on filesystems like ext4, the current
implementation of MADV_POPULATE_READ does not replace fallocate(). Am
I missing something?
If the desired semantics are that disk blocks should be preallocated,
I think you may have to look up the ->vm_file and then internally call
vfs_fallocate() to address this, kinda like in madvise_remove()?
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
(prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
prefault page tables just like manually writing (or
reading+writing) each individual page. This will break any COW
mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
(prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
mappings marked with VM_PFNMAP and VM_IO. Also, proper access
permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
might have been populated. In that case, madvise() fails with
-ENOMEM.
AFAICS that's not true (or misphrased). If MADV_POPULATE_*
successfully populates a bunch of pages, then fails because of an
error (e.g. EHWPOISON), it will return EHWPOISON, not ENOMEM, right?
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
cannot protect from the OOM (Out Of Memory) handler killing the
process.
While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), there are valid use
cases for MADV_POPULATE_READ:
1. Efficiently populate page tables with zero pages (i.e., shared
zeropage). This is necessary when using userfaultfd() WP (Write-Protect
to properly catch all modifications within a mapping: for
write-protection to be effective for a virtual address, there has to be
a page already mapped -- even if it's the shared zeropage.
This sounds like a hack to work around issues that would be better
addressed by improving userfaultfd?
2. Pre-read a whole mapping from backend storage without marking it
dirty, such that eviction won't have to write it back. If no backend
memory has been allocated yet, allocate the backend memory. Helpful
when preallocating/prefaulting a file stored on disk without having
to writeback each and every page on eviction.
This sounds reasonable to me.
Although sparse memory mappings are the primary use case, this will[...]
also be useful for ordinary preallocations where MAP_POPULATE is not
desired especially in QEMU, where users can trigger preallocation of
guest RAM after the mapping was created.
Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements
(which should also still be the case, but it is a secondary concern).
V. Single-threaded performance comparison
There is a performance benefit when using POPULATE_READ / POPULATE_WRITE
already when only using a single thread to do prefaulting/preallocation. As
we have less pagefaults for huge pages, the performance benefit is
negligible with small mappings.
diff --git a/mm/gup.c b/mm/gup.c[...]
+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, bool write, int *locked)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long nr_pages = (end - start) / PAGE_SIZE;
+ int gup_flags;
+
+ VM_BUG_ON(!PAGE_ALIGNED(start));
+ VM_BUG_ON(!PAGE_ALIGNED(end));
+ VM_BUG_ON_VMA(start < vma->vm_start, vma);
+ VM_BUG_ON_VMA(end > vma->vm_end, vma);
+ mmap_assert_locked(mm);
+
+ /*
+ * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+ * a poisoned page.
+ * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+ * !FOLL_FORCE: Require proper access permissions.
+ */
+ gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+ if (write)
+ gup_flags |= FOLL_WRITE;
+
+ /*
+ * See check_vma_flags(): Will return -EFAULT on incompatible mappings
+ * or with insufficient permissions.
+ */
+ return __get_user_pages(mm, start, nr_pages, gup_flags,
+ NULL, NULL, locked);
You mentioned in the commit message that you don't want to actually
dirty all the file pages and force writeback; but doesn't
POPULATE_WRITE still do exactly that? In follow_page_pte(), if
FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
set_page_dirty(page);
/*
* pte_mkyoung() would be more correct here, but atomic care
* is needed to avoid losing the dirty bit: it is easier to use
* mark_page_accessed().
*/
mark_page_accessed(page);
}
+}
+
/*
* __mm_populate - populate and/or mlock pages within a range of address space.
*
diff --git a/mm/internal.h b/mm/internal.h
index 3f22c4ceb7b5..ee398696380f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -335,6 +335,9 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
#ifdef CONFIG_MMU
extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *locked);
+extern long faultin_vma_page_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ bool write, int *locked);
extern void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
diff --git a/mm/madvise.c b/mm/madvise.c
index 01fef79ac761..857460873f7a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -53,6 +53,8 @@ static int madvise_need_mmap_write(int behavior)
case MADV_COLD:
case MADV_PAGEOUT:
case MADV_FREE:
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -822,6 +824,64 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}
+static long madvise_populate(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end,
+ int behavior)
+{
+ const bool write = behavior == MADV_POPULATE_WRITE;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long tmp_end;
+ int locked = 1;
+ long pages;
+
+ *prev = vma;
+
+ while (start < end) {
+ /*
+ * We might have temporarily dropped the lock. For example,
+ * our VMA might have been split.
+ */
+ if (!vma || start >= vma->vm_end) {
+ vma = find_vma(mm, start);
+ if (!vma || start < vma->vm_start)
+ return -ENOMEM;
+ }
+
+ tmp_end = min_t(unsigned long, end, vma->vm_end);
+ /* Populate (prefault) page tables readable/writable. */
+ pages = faultin_vma_page_range(vma, start, tmp_end, write,
+ &locked);
+ if (!locked) {
+ mmap_read_lock(mm);
+ locked = 1;
+ *prev = NULL;
+ vma = NULL;
+ }
+ if (pages < 0) {
+ switch (pages) {
+ case -EINTR:
+ return -EINTR;
+ case -EFAULT: /* Incompatible mappings / permissions. */
+ return -EINVAL;
+ case -EHWPOISON:
+ return -EHWPOISON;
+ case -EBUSY:
What is -EBUSY doing here? __get_user_pages() fixes up -EBUSY from
faultin_page() to 0, right?
+ case -EAGAIN:
Where can -EAGAIN come from?