[RFC PATCH 0/6] Add support for shared PTEs across processes

From: Khalid Aziz
Date: Tue Jan 18 2022 - 16:21:45 EST


Page tables in kernel consume some of the memory and as long as
number of mappings being maintained is small enough, this space
consumed by page tables is not objectionable. When very few memory
pages are shared between processes, the number of page table entries
(PTEs) to maintain is mostly constrained by the number of pages of
memory on the system. As the number of shared pages and the number
of times pages are shared goes up, amount of memory consumed by page
tables starts to become significant.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system carsh was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This is a proposal to implement a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. The proposal is to add
a new system call - mshare(), which can be used by a process to
create a region (we will call it mshare'd region) which can be used
by other processes to map same pages using shared PTEs. Other
process(es), assuming they have the right permissions, can then make
the mashare() system call to map the shared pages into their address
space using the shared PTEs. When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.


API Proposal
============

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64)
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
shared automatically
- Shared address range can have unmapped addresses in it. Any access
to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.


Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, 0, 0);
if (addr == MAP_FAILED)
perror("ERROR: mmap failed");

err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
if (err < 0) {
perror("mshare() syscall failed");
exit(1);
}

strncpy(addr, "Some random shared text",
sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
fd = open("testregion", O_RDONLY);
if (fd < 0) {
perror("open failed");
exit(1);
}

if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
printf("INFO: %ld bytes shared at addr %lx \n",
mshare_info[1], mshare_info[0]);
else
perror("read failed");

close(fd);

addr = (char *)mshare_info[0];
err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
mshare_info[1], O_RDWR, 600);
if (err < 0) {
perror("mshare() syscall failed");
exit(1);
}

printf("Guest mmap at %px:\n", addr);
printf("%s\n", addr);
printf("\nDone\n");

err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
if (err < 0) {
perror("mshare_unlink() failed");
exit(1);
}
-----------------


RFC Proposal
============

This series of RFC patches is a prototype implementation of these
two system calls. This code is just a prototype with many bugs and
limitations and is incomplete. It works well enough to show how
this concept will work.

Prototype for the two syscalls is:

SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
unsigned long, len, int, oflag, mode_t, mode)

SYSCALL_DEFINE1(mshare_unlink, const char *, name)

In oreder to facilitate page table sharing, this implemntation adds
a new in-memory filesystem - msharefs which will be mounted at
/sys/fs/mshare. When a new mshare'd region is
created, a file with the name given by initial mshare() call is
created under this filesystem. Permissions for this file are given
by the "mode" parameter to mshare(). The only operation supported on
this file is read. A read from this file returns two long values -
(1) starting virtual address for the region, and (2) size of
mshare'd region.

A donor process that wants to create an mshare'd region from a
section of its mapped addresses calls mshare() with O_CREAT oflag.
mshare() syscall then creates a new mm_struct which will host the
page tables for the mshare'd region. vma->vm_mm for the vmas
covering address range for this region are updated to point to this
new mm_struct instead of current->mm. Existing page tables are
copied over to new mm struct.

A consumer process that wants to map mshare'd region opens
/sys/fs/mshare/<filename> and reads the starting address and size of
mshare'd region. It then calls mshare() with these values to map the
entire region in its address space. Consumer process calls
mshare_unlink() to terminate its access.

I would very much appreciate feedback on - (1) the API and how it
defines the PTE sharing concept, and (2) core of the initial
implementation. The file mm/mshare.c contains a list of current
issues/bugs/holes in the code at this time at the top of the file.
If the general direction of this work looks reasonable, I will
continue working on adding the missing code including much of error
handling and fixing bugs.


Khalid Aziz (6):
mm: Add new system calls mshare, mshare_unlink
mm: Add msharefs filesystem
mm: Add read for msharefs
mm: implement mshare_unlink syscall
mm: Add locking to msharefs syscalls
mm: Add basic page table sharing using mshare

Documentation/filesystems/msharefs.rst | 19 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/mm.h | 8 +
include/trace/events/mmflags.h | 3 +-
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
mm/Makefile | 2 +-
mm/internal.h | 7 +
mm/memory.c | 35 +-
mm/mshare.c | 604 +++++++++++++++++++++++++
10 files changed, 682 insertions(+), 6 deletions(-)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 mm/mshare.c

--
2.32.0