Re: [PATCH 2/2] mm: adds NOSIGBUS extension for out-of-band shmem read

From: Ming Lin
Date: Wed Jun 02 2021 - 20:05:50 EST


On 6/1/2021 8:49 PM, Hugh Dickins wrote:

index 096bba4..69cd856 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1419,6 +1419,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (!len)
return -EINVAL;
+ if ((flags & MAP_NOSIGBUS) && ((prot & PROT_WRITE) || !shmem_file(file)))
+ return -EINVAL;
+

No, for several reasons.

This has nothing to do with shmem really, that's just where this patch
hacks it in - and where you have a first user in mind. If this goes
forward, please modify mm/memory.c not mm/shmem.c, to make
VM_FAULT_SIGBUS on fault to VM_NOSIGBUS vma do the mapping of zero page.

(prot & PROT_WRITE) tells you about the mmap() flags, but says nothing
about what mprotect() could do later on. Look out for VM_SHARED and
VM_MAYSHARE and VM_MAYWRITE further down; and beware the else (!file)
block below them, shared anonymous would need more protection too.

Constructive comment: I guess much of my objection to this feature
comes from allowing it in the MAP_SHARED case. If you restrict it
to MAP_PRIVATE mapping of file, then it's less objectionable, and
you won't have to worry (so much?) about write protection. Copy
on write is normal there, and it's well established that subsequent
changes in the file will not be shared; you'd just be extending that
behaviour from writes to sigbusy reads.

And by restricting to MAP_PRIVATE, you would allow for adding a
proper MAP_SHARED implementation later, if it's thought useful
(that being the implementation which can subsequently unmap a
zero page to let new page cache be mapped).

This is what I wrote so far.

---
include/linux/mm.h | 2 ++
include/linux/mman.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/memory.c | 12 ++++++++++++
mm/mmap.c | 4 ++++
5 files changed, 20 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e9d67bc..af9e277 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -373,6 +373,8 @@ int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
# define VM_UFFD_MINOR VM_NONE
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#define VM_NOSIGBUS VM_FLAGS_BIT(38) /* Do not SIGBUS on fault */
+
/* Bits set in the VMA until the stack is in its final location */
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
diff --git a/include/linux/mman.h b/include/linux/mman.h
index b2cbae9..c966b08 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -154,6 +154,7 @@ static inline bool arch_validate_flags(unsigned long flags)
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
+ _calc_vm_trans(flags, MAP_NOSIGBUS, VM_NOSIGBUS ) |
arch_calc_vm_flag_bits(flags);
}
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d..a2a5333 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@
#define MAP_HUGETLB 0x040000 /* create a huge page mapping */
#define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */
#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/mm/memory.c b/mm/memory.c
index eff2a47..7195dac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3676,6 +3676,18 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
}
ret = vma->vm_ops->fault(vmf);
+ if (unlikely(ret & VM_FAULT_SIGBUS) && (vma->vm_flags & VM_NOSIGBUS)) {
+ /*
+ * Get zero page for MAP_NOSIGBUS mapping, which isn't
+ * coherent wrt shmem contents that are expanded and
+ * filled in later.
+ */
+ vma->vm_flags |= VM_MIXEDMAP;
+ if (!vm_insert_page(vma, (unsigned long)vmf->address,
+ ZERO_PAGE(vmf->address)))
+ return VM_FAULT_NOPAGE;
+ }
+
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
return ret;
diff --git a/mm/mmap.c b/mm/mmap.c
index 096bba4..74fb49a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1419,6 +1419,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (!len)
return -EINVAL;
+ /* Restrict MAP_NOSIGBUS to MAP_PRIVATE mapping */
+ if ((flags & MAP_NOSIGBUS) && !(flags & MAP_PRIVATE))
+ return -EINVAL;
+
/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*


/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
diff --git a/mm/shmem.c b/mm/shmem.c
index 5d46611..5d15b08 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1812,7 +1812,22 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
repeat:
if (sgp <= SGP_CACHE &&
((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
- return -EINVAL;
+ if (!vma || !(vma->vm_flags & VM_NOSIGBUS))
+ return -EINVAL;
+
+ vma->vm_flags |= VM_MIXEDMAP;

No. Presumably you hit the BUG_ON(mmap_read_trylock(vma->vm_mm))
in vm_insert_page(), so decided to modify the vm_flags here: no,
that BUG is saying you need mmap_write_lock() to write vm_flags.

But the comments above vm_insert_page() told me to set VM_MIXEDMAP on vma

* Usually this function is called from f_op->mmap() handler
* under mm->mmap_lock write-lock, so it can change vma->vm_flags.
* Caller must set VM_MIXEDMAP on vma if it wants to call this
* function from other places, for example from page-fault handler.


One other thing while it crosses my mind. You'll need to decide
what truncating or hole-punching the file does to the zero pages
in its userspace mappings. I may turn out wrong, but I think you'll
find that truncation removes them, but hole-punch leaves them, and
ought to be modified to remove them too (it's a matter of how the
"even_cows" arg to unmap_mapping_range() is treated).

I did a quick test, after inserting zero pages, seems that truncation
also leaves the mappings.

I'm still reading code to learn this part ...