Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and_NONVOLATILE flags
From: Andrew Morton
Date: Tue Nov 22 2011 - 15:52:11 EST
On Mon, 21 Nov 2011 19:33:08 -0800
John Stultz <john.stultz@xxxxxxxxxx> wrote:
> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.
>
> This is useful for userspace to allocate things like caches, and lets
> the kernel destructively (but safely) reclaim them when there's memory
> pressure.
>
> Right now, we can simply throw away pages if they are clean (backed
> by a current on-disk copy). That only happens for anonymous/tmpfs/shmfs
> pages when they're swapped out. This patch lets userspace select
> dirty pages which can be simply thrown away instead of writing them
> to disk first. See the mm/shmem.c for this bit of code. It's
> different from FADV_DONTNEED since the pages are not immediately
> discarded; they are only discarded under pressure.
>
> This is very much influenced by the Android Ashmem interface by
> Robert Love so credits to him and the Android developers.
> In many cases the code & logic come directly from the ashmem patch.
> The intent of this patch is to allow for ashmem-like behavior, but
> embeds the idea a little deeper into the VM code, instead of isolating
> it into a specific driver.
>
> I'm very much a newbie at the VM code, so At this point, I just want
> to try to get some input on the patch, so if you have another idea
> for using something other then fadvise, or other thoughts on how the
> volatile ranges are stored, I'd be really interested in hearing them.
> So let me know if you have any comments for feedback!
>
> Also many thanks to Dave Hansen who helped design and develop the
> initial version of this patch, and has provided continued review and
> mentoring for me in the VM code.
I'm interestedly watching the design/use-case discussion. Meanwhile,
some comments on the implementation.
>
> ...
>
> +#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */
> +#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */
> +#define POSIX_FADV_ISVOLATILE 10 /* Returns volatile flag for region */
linux-man@xxxxxxxxxxxxxxx will want to be told all about these at some
stage.
> +
> +
> #endif /* FADVISE_H_INCLUDED */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e313022..4f15ade 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -10,6 +10,7 @@
> #include <linux/ioctl.h>
> #include <linux/blk_types.h>
> #include <linux/types.h>
> +#include <linux/volatile.h>
volatile is a C keyword. This is a bit confusing/misleading.
> /*
> * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> @@ -650,6 +651,7 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> + struct list_head volatile_list; /* volatile range list */
Comment should tell us what lock protects this.
It appers to be i_mmap_lock, which is weird.
> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> diff --git a/include/linux/volatile.h b/include/linux/volatile.h
> new file mode 100644
> index 0000000..11e8a3e
> --- /dev/null
> +++ b/include/linux/volatile.h
> @@ -0,0 +1,34 @@
> +#ifndef _LINUX_VOLATILE_H
> +#define _LINUX_VOLATILE_H
> +
> +struct address_space;
> +
> +
> +struct volatile_range {
> + /*
> + * List is sorted, and no two ranges
sorted by pgoff_t, it appears.
> + * on the same list should overlap.
> + */
> + struct list_head unpinned;
What's this do? It appears to be the list anchored in
address_space.volatile_list and protected by i_mmap_lock to maintain
all these things. "unpinned" is a strange name for such a thing.
> + pgoff_t start_page;
> + pgoff_t end_page;
> + unsigned int purged;
Some description here would be good. What's it for, when is it set and
cleared.
> +};
> +
> +static inline bool page_in_range(struct volatile_range *range,
> + pgoff_t page_index)
> +{
> + return (range->start_page <= page_index) &&
> + (range->end_page >= page_index);
> +}
"page_in_range" is too vague a name for this...
> +extern long mapping_range_volatile(struct address_space *mapping,
> + pgoff_t start_index, pgoff_t end_index);
> +extern long mapping_range_nonvolatile(struct address_space *mapping,
> + pgoff_t start_index, pgoff_t end_index);
> +extern long mapping_range_isvolatile(struct address_space *mapping,
> + pgoff_t start_index, pgoff_t end_index);
> +extern void mapping_clear_volatile_ranges(struct address_space *mapping);
> +
> +
> +#endif /* _LINUX_VOLATILE_H */
>
> ...
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -679,6 +679,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> index = page->index;
> inode = mapping->host;
> info = SHMEM_I(inode);
> +
> + /* Check if page is in volatile range */
> + if (!list_empty(&mapping->volatile_list)) {
> + struct volatile_range *range, *next;
> + list_for_each_entry_safe(range, next, &mapping->volatile_list,
> + unpinned) {
> + if (page_in_range(range, index)) {
> + range->purged = 1;
> + unlock_page(page);
> + return 0;
> + }
> + }
> + }
That's very optimistic code :(
We've handed users a way in which to consume arbitrarily vast amounts
of kernel CPU time. Putting a cond_resched() in there won't fix this.
Also, the volatile_range's are kmalloced so we have also given the user
a way of quickly consuming unlimited amounts of kernel memory. Not
good.
>
> ...
>
> +/*
> + * Allocates a volatile_range, and adds it to the address_space's
> + * volatile list
> + */
> +static int volatile_range_alloc(struct volatile_range *prev_range,
> + unsigned int purged,
> + pgoff_t start_index, pgoff_t end_index)
> +{
> + struct volatile_range *range;
> +
> + range = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
> + if (!range)
> + return -ENOMEM;
> +
> + range->start_page = start_index;
> + range->end_page = end_index;
> + range->purged = purged;
> +
> + list_add_tail(&range->unpinned, &prev_range->unpinned);
> +
> + return 0;
> +}
Back in, ahem, 1999 I wrote a piece of tree code which I think does
this. Each node represents a span and the nodes are kept sorted in the
tree and it does merging and splitting of nodes as ranges are added.
Removal and lookup don't appear to be implemented yet, but they're easy
enough. Pretty complex but everything is O(log(n)) and I think it
could be made to work here.
<rummage rummage>
OK, see attached.
Also, do you know what this code is like? Posix and BSD file locking.
It has the same requirement to maintain state about arbitrary spans of
a single file.
The locking code is an utter pig and has well known (?) O(N^2) and even
O(N^3) failure modes. But don't let that deter you ;) We should
strenuously avoid duplicating such a similar thing.
Attachment:
mumbletree.tar.gz
Description: Binary data