[PATCH 0/8] Volatile Ranges (v8?)
From: John Stultz
Date: Wed Jun 12 2013 - 00:23:48 EST
Hey everyone.
I know its been quite awhile. But Minchan and I have been doing a
fair amount of discussing offlist since lsf-mm, trying to come
to agreement on the semantics for the volatile ranges interface,
and after circling around each other's arguments for awhile (he'd
suggest and idea, I'd disagree, then I'd come around to agree just as
he would begin to disagree :) I think things have started to converge
pretty nicely, at least as far as the interface goes.
Some of the more interesting and challenging ideas we've explored
recently have been given up for now, mostly so we can get some core
agreed functionality moving upstream. We may still want to revisit
those ideas before the final push, but for now, we're focusing on
the parts we agree on that we think have a chance at eventually
being merged.
If you've read some of my earlier summaries, you'll likely find
this patchset much simplified:
* We only have one interface: vrange(address, len, mode, *purged),
which is used in a method similar to madvise on both file or
anonymous pages.
* We no longer have a concept of anon-only or private-volatility.
Despite the potential performance gains that Minchan liked in
avoiding the mmap_sem,the semantics were often confusing when using
private volatility on non-anonymous pages.
* We no longer have behavior flags. Potential extensions can still be
done via introducing new mode flags.
The patch set has also been heavily reworked and reordered to make
more iterative sense and hopefully to be easier to review.
Patches 1-5 are what we're wanting the most feedback on, since this
is the area dealing with the userland interface and the semantics of
how volatile ranges behave.
Patches 6-8 provide the back-end purging logic, which is likely
to change, and is provided only so folks can start playing around
with a functional patch series. It currently has some limitations,
like it doesn't purge anonymous pages on swap free systems.
Additionally, the newly integrated file page purging logic likely has
issues still to be resolved.
Overall, We still have the following TODOS with the patchset:
* Come to consensus on the best way to avoid inheriting mm_struct
volatility when the underlying vmas change. (see patch 4 in this
series)
* Ensure we zap underlying file page (ala truncate_inode_pages_range)
when we purge file pages - this make purging similar to file hole
punching and ensures we don't find stale data later. (patch 7)
* Avoid lockdep warnings caused by allocations made while holding vroot
lock triggering reclaim which could try to purge volatile ranges,
grabbing the same vroot lock. Minchan added a GFP_NO_VRANGE flag,
but we've not hooked that up into the reclaim logic to avoid purging.
* Re-integrate Minchan's logic to purge anonymous pages on swapfree
systems (dropped for this release to keep things simpler for review)
Any feedback and review would be greatly appreciated!
thanks!
-john
Volatile Ranges
==============
Volatile ranges provide a way for userland applications to provide
hints to the kernel, about memory that is not immediately in use and
can be regenerated if needed.
After marking a range as volatile, if the kernel experiences memory
pressure, the kernel can then purge those pages, freeing up additional
space. Userland can also tell the kernel it wants to use that memory
again, by marking the range non-volatile, after which the kernel will
not purge that memory.
If the kernel has already purged the memory when userland requests
it be made non-volatile, the kernel will return a warning value to
notify userland that the data was lost and must be regenerated.
If userland accesses memory marked volatile that has not been purged,
it will get the values it expects.
However, if userland touches volatile memory that has been purged, the
kernel will send it a SIGBUS. This makes it possible for userland to
handle the SIGBUS, marking the memory as non-volatile and re-generating
it as needed before continuing.
In some ways, the kernel's purging of memory can be considered
as similar to a delayed MADV_DONTNEED or FALLOC_FL_PUNCH_HOLE
operation, which can be canceled. Thus similarly to MADV_DONTNEED
or FALLOC_FL_PUNCH_HOLE, operations done on file data that is mmaped
shared will be seen by other processes who have that file mapped. Thus
if an application marks shared mmaped file data as volatile, that
volatility state is also shared across other tasks. This allows tasks
to coordinate for one task to mark shared file data as volatile, and a
second task to be able to unmark it if necessary. If the kernel purges
volatile file data that was marked by one task, all tasks sharing
that data will see the data as purged, and will have to mark it as
non-volatile before accessing it or will have to handle the SIGBUS.
All volatility on files is cleared when the last fd handle is closed.
Interface:
The vrange syscall is defined as follows:
int vrange(unsigned long address, size_t length, int mode, int* purged)
address: Starting address in the process where memory will be
marked. This must be page aligned
length: Length of the range to be marked. This must be in page
size units.
mode:
VRANGE_VOLATILE: Marks the specified range as volatile, and
able to be purged.
VRANGE_NONVOLATILE: Marks the specified range as non-volatile. If
any data in that range was volatile and has
been purged, 1 will be returned in the purged
pointer.
purged: Pointer to an integer that will be set to 1 if any data
in the range being marked non-volatile has been purged
and is lost. If it is zero, then no data in the
specified range has been lost.
Return values:
Returns the number of bytes marked or unmarked. Similar
to write(), it may return fewer bytes then specified
if it ran into a problem.
If an error (negative value) is returned,no changes
were made.
Errors:
EINVAL:
* address is not page-aligned, or is invalid.
* length is not a multiple of the page size.
* length is negative.
ENOMEM:
* Not enough memory.
EFAULT:
* Purge pointer is invalid.
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
Cc: Robert Love <rlove@xxxxxxxxxx>
Cc: Mel Gorman <mel@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Dmitry Adamushko <dmitry.adamushko@xxxxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Neil Brown <neilb@xxxxxxx>
Cc: Andrea Righi <andrea@xxxxxxxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Cc: Mike Hommey <mh@xxxxxxxxxxxx>
Cc: Taras Glek <tglek@xxxxxxxxxxx>
Cc: Dhaval Giani <dgiani@xxxxxxxxxxx>
Cc: Jan Kara <jack@xxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>
Cc: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: linux-mm@xxxxxxxxx <linux-mm@xxxxxxxxx>
John Stultz (2):
vrange: Add vrange support for file address_spaces
vrange: Clear volatility on new mmaps
Minchan Kim (6):
vrange: Add basic data structure and functions
vrange: Add vrange support to mm_structs
vrange: Add new vrange(2) system call
vrange: Add GFP_NO_VRANGE allocation flag
vrange: Add method to purge volatile ranges
vrange: Send SIGBUS when user try to access purged page
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/file_table.c | 5 +
fs/inode.c | 2 +
include/asm-generic/pgtable.h | 11 +
include/linux/fs.h | 2 +
include/linux/gfp.h | 7 +-
include/linux/mm_types.h | 5 +
include/linux/rmap.h | 12 +-
include/linux/swap.h | 1 +
include/linux/vrange.h | 60 +++
include/linux/vrange_types.h | 19 +
include/uapi/asm-generic/mman-common.h | 3 +
init/main.c | 2 +
kernel/fork.c | 6 +
lib/Makefile | 2 +-
mm/Makefile | 2 +-
mm/ksm.c | 2 +-
mm/memory.c | 23 +-
mm/mmap.c | 5 +
mm/rmap.c | 30 +-
mm/swapfile.c | 36 ++
mm/vmscan.c | 16 +-
mm/vrange.c | 731 +++++++++++++++++++++++++++++++++
24 files changed, 963 insertions(+), 22 deletions(-)
create mode 100644 include/linux/vrange.h
create mode 100644 include/linux/vrange_types.h
create mode 100644 mm/vrange.c
--
1.8.1.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/