Hi John,
On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:After Kernel Summit and Plumbers, I wanted to consider all the variousJust out of curiosity.
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.
Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :) In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)
General Interface semantics:
----------------------------------------------
The high level interface I've been pushing has so far stayed fairly
consistent:
Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.
But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:
Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.
Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
Why should application remark it as volatile again?
It have been already volatile range and application doesn't receive
any signal while it uses that range. So I think it doesn't need to
remark.
"lazy" with its marking, allowing the kernel to hit it with a signalI like this model if plumbers really want it.
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.
However, If applications don't want to deal with handling theMy idea is that we don't need to move all pages in the range
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.
This allows folks to balance the cost vs complexity in their
application appropriately.
So that's a general overview of how the idea I'm proposing could
be used.
to tail of LRU or new LRU list. Just move a page in the range
into tail of LRU or new LRU list. And when reclaimer start to find
victim page, it can know this page is volatile by something
(ex, if we use new LRU list, we can know it easily, Otherwise,
we can use VMA's new flag - VM_VOLATILE and we can know it easily
by page_check_references's tweak) and isolate all pages of the range
in middle of LRU list and reclaim them all at once.
So the cost of marking is just a (search cost for finding in-memory
page of the range + moving a page between LRU or from middle to tail)
It means we can move the cost time from mark/unmark to reclaim point.
[snip]
Specific Interface semantics:
---------------------------------------------
Here are some of the open question about how the user interface
should look:
fd based interfaces vs madvise:I like madvise interface because allocator can use it for memory pool.
In talking with Taras Glek, he pointed out that for his
needs, the fd based interface is a little annoying, as it
requires having to get access to tmpfs file and mmap it in,
then instead of just referencing a pointer to the data he
wants to mark volatile, he has to calculate the offset from
start of the mmap and pass those file offsets to the interface.
Instead he mentioned that using something like madvise would be
much nicer, since they could just pass a pointer to the object
in memory they want to make volatile and avoid the extra work.
I'm not opposed to adding an madvise interface for this as
well, but since we have a existing use case with Android's
ashmem, I want to make sure we support this existing behavior.
Specifically as with ashmem applications can be sharing
these tmpfs fds, and so file-relative volatile ranges make
more sense if you need to coordinate what data is volatile
between two applications.
Also, while I agree that having an madvise interface for
volatile ranges would be nice, it does open up some more
complex implementation issues, since with files, there is a
fixed relationship between pages and the files' address_space
mapping, where you can't have pages shared between different
mappings. This makes it easy to hang the volatile-range tree
off of the mapping (well, indirectly via a hash table). With
general anonymous memory, pages can be shared between multiple
processes, and as far as I understand, don't have any grouping
structure we could use to determine if the page is in a
volatile range or not. We would also need to determine more
complex questions like: What are the semantics of volatility
with copy-on-write pages? I'm hoping to investigate this
idea more deeply soon so I can be sure whatever is pushed has
a clear plan of how to address this idea. Further thoughts
here would be appreciated.
If allocator has free memory which just return from application
he can mark it into volatile so VM can reclaim that pages without swapout
when memory pressure happens and it can unmark it before allocating.
It would be more effective rather than calling munmap or madvise(DONTNEED)
which those operations requires all page table operation and even vma
unlinking in case of munmap.
For it, we can add new VMA flag VM_VOLATILE and we can use reverse mapping
for grouping structure. For COW semantics, I think we can discard volatile
page only if all vmas which share the page don't have VM_VOLATILE.
[snip]
It would really be great to get any thoughts on these issues, as they
are higher-priority to me then diving into the details of how we
implement this internally, which can shift over time.
Implementation Considerations:
---------------------------------------------
* Purging order between volatile ranges:My thought is that "let it be" without creating new LRU list or
Again, since it makes sense to purge all the complete
pages in a range at the same time, we need to consider the
subtle difference between the least-recently-used pages vs
least-recently-used ranges. A single range could contain very
frequently accessed data, as well as rarely accessed data.
One must also consider that the act of marking a range as
volatile may not actually touch the underlying pages. Thus
purging ranges based on a least-recently-used page may also
result in purging the most-recently used page.
Android addressed the purging order question by purging ranges
in the order they were marked volatile. Thus the oldest
volatile range is the first range to be purged. This works
well in the Android model, as applications aren't supposed
to access volatile data, so the least-recently-marked-volatile
order maps well to the least-recently-used-range.
However, this assumption doesn't hold with the lazy SIGBUS
notification method, as pages in a volatile range may continue
to be accessed after the range is marked volatile. So the
question as to what is the best order of purging volatile
ranges is definitely open.
Abstractly the ideal solution might be to evaluate the
most-recently used page in each range, and to purge the range
with the oldest recently-used-page, but I suspect this is
not something that could be calculated efficiently.
Additionally, in my conversations with Taras, he pointed out
that if we are using a one-application-at-a-time UI model,
it would be ideal to discourage purging volatile data used by
the current application, instead prioritizing volatile ranges
from applications that aren't active. However, I'm not sure
what mechanism could be used to prioritize range purging in
this fashion, especially considering volatile ranges can be
on data that is shared between applications.
deactivating volatile page to tail of LRU for early reclaiming.
It means volatile pages has same priorty with other normal pages.
Volatile doesn't mean "Early reclaim" but "we don't need to swap out
them for reclaiming" in my perception.
In summary, the idea I am suggesting now if we select lazy SIGBUS is
following as,
1) use madvise and VMA rmap with newly VM_VOLATILE page
2) mark
just mark VM_VOLATILE in the VMA
3) treat volatile pages same with normal pages in POV aging
(Of course, non swap system, we have to tweak VM for reclaimaing
volatile pages, for example, we can move all volatile pages into
inactive's tail when swapoff happens and VM peek tail of inactive
LRU list without aging when memory pressure happens)
4) unmark
just unmark VM_VOLATILE in the VMA and return error.