Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and_NONVOLATILE flags
From: John Stultz
Date: Tue Nov 22 2011 - 14:49:16 EST
On Tue, 2011-11-22 at 04:37 -0500, Rik van Riel wrote:
> On 11/21/2011 10:33 PM, John Stultz wrote:
> > This patch provides new fadvise flags that can be used to mark
> > file pages as volatile, which will allow it to be discarded if the
> > kernel wants to reclaim memory.
> >
> > This is useful for userspace to allocate things like caches, and lets
> > the kernel destructively (but safely) reclaim them when there's memory
> > pressure.
> >
> > Right now, we can simply throw away pages if they are clean (backed
> > by a current on-disk copy). That only happens for anonymous/tmpfs/shmfs
> > pages when they're swapped out. This patch lets userspace select
> > dirty pages which can be simply thrown away instead of writing them
> > to disk first. See the mm/shmem.c for this bit of code. It's
> > different from FADV_DONTNEED since the pages are not immediately
> > discarded; they are only discarded under pressure.
>
> I've got a few questions:
>
> 1) How do you tell userspace some of its data got
> discarded?
You get a return code when marking the page non-volatile if it has been
discarded. This follows the ashmem style that Robert described in the
other mail.
> 2) How do you prevent the situation where every
> volatile object gets a few pages discarded, making
> them all unusable?
> (better to throw away an entire object at once)
Indeed. One of the issues folks brought up about the ashmem code was
that it manages its own lru. This attempt just simplifies the code, by
using the kerenl's own lru, but does have the draw back that it is page
based instead of object or range-based.
We could try to zap the entire range when a page from the range is
written out, or we could go back to using a range based lru, like ashmem
does.
> 3) Isn't it too slow for something like Firefox to
> create a new tmpfs object for every single throw-away
> cache object?
So, if you mean creating a new file for every cache object, that doesn't
seem necessary, as you could map a number of objects into the same file
and mark the ranges as volatile or not as needed.
Or are you worried about the allocation of the range structure when we
mark a region as volatile?
Either way, I'd defer to Robert on real-world usage.
> Johannes, Jon and I have looked at an alternative way to
> allow the kernel and userspace to cooperate in throwing
> out cached data. This alternative way does not touch
> the alloc/free fast path at all, but does require some
> cooperation at "shrink cache" time.
>
> The idea is quite simple:
>
> 1) Every program that we are interested in already has
> some kind of main loop where it polls on file descriptors.
> It is easy for such programs to add an additional file,
> which would be a device or sysfs file that wakes up the
> program from its poll/select loop when memory is getting
> full to the point that userspace needs to shrink its
> caches.
>
> The kernel can be smart here and wake up just one process
> at a time, targeting specific NUMA nodes or cgroups. Such
> kernel smarts do not require additional userspace changes.
>
> 2) When userspace gets such a "please shrink your caches"
> event, it can do various things. A program like firefox
> could throw away several cached objects, eg. uncompressed
> images or entire pre-rendered tabs, while a JVM can shrink
> its heap size and a database could shrink its internal
> cache.
So similarly to Robert, I don't see this approach as necessarily
exclusive to the volatile flags. There are just some tradeoffs with the
different approaches.
The upside with your approach is that applications don't have to
remember to re-pin the cache before using it and unpin it after its done
using it.
The downside is that the "please shrink your caches" message is likely
to arrive when the system is low on resources. Once applications have
been asked to "be nice and get small!", having to wait for that action
to occur might not be great. Where as with the volatile regions, there
are just additionally easily freeable pages available when the kernel
needs them.
thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/