Re: [PATCH] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem

From: Suren Baghdasaryan
Date: Tue Nov 02 2021 - 14:12:25 EST


()

On Mon, Nov 1, 2021 at 7:40 AM Charan Teja Kalla
<charante@xxxxxxxxxxxxxx> wrote:
>
> Thanks for your comments Suren.
>
> On 10/29/2021 12:10 AM, Suren Baghdasaryan wrote:
> > On Fri, Oct 8, 2021 at 7:07 AM Charan Teja Reddy
> > <charante@xxxxxxxxxxxxxx> wrote:
> >>
> >> Currently fadvise(2) is supported only for the files that doesn't
> >> associated with noop_backing_dev_info thus for the files, like shmem,
> >> fadvise results into NOP. But then there is file_operations->fadvise()
> >> that lets the file systems to implement their own fadvise
> >> implementation. Use this support to implement some of the POSIX_FADV_XXX
> >> functionality for shmem files.
> >>
> >> This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
> >> advices to shmem files which can be helpful for the drivers who may want
> >> to manage the shmem pages of the files that are created through
> >> shmem_file_setup[_with_mnt](). An example usecase may be like, driver
> >> can create the shmem file of the size equal to its requirements and
> >> map the pages for DMA and then pass the fd to user. The user who knows
> >> well about the usage of these pages can now decide when these pages are
> >> not required push them to swap through DONTNEED thus free up memory well
> >> in advance rather than relying on the reclaim and use WILLNEED when it
> >> decide that they are useful in the near future. IOW, it lets the clients
> >> to free up/read the memory when it wants to. Another usecase is that GEM
> >> objets which are currenlty allocated and managed through shmem files can
> >> use vfs_fadvise(DONT|WILLNEED) on shmem fd when the driver comes to
> >> know(like through some hints from user space) that GEM objects are not
> >> going to use/will need in the near future.
> >
> > Sounds useful.
> > You should probably add a documentation section for manual pages.
>
> Sure. Will add them.
>
> >
> >>
> >> Signed-off-by: Charan Teja Reddy <charante@xxxxxxxxxxxxxx>
> >> ---
> >> mm/shmem.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 139 insertions(+)
> >>
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 70d9ce2..ab7ea33 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -38,6 +38,8 @@
> >> #include <linux/hugetlb.h>
> >> #include <linux/frontswap.h>
> >> #include <linux/fs_parser.h>
> >> +#include <linux/mm_inline.h>
> >> +#include <linux/fadvise.h>
> >>
> >> #include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */
> >>
> >> @@ -2792,6 +2794,142 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >> return error;
> >> }
> >>
> >> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
> >> + loff_t end)
> >> +{
> >> + int ret;
> >> + struct writeback_control wbc = {
> >> + .sync_mode = WB_SYNC_NONE,
> >> + .nr_to_write = LONG_MAX,
> >> + .range_start = 0,
> >> + .range_end = LLONG_MAX,
> >> + .for_reclaim = 1,
> >> + };
> >> + struct page *page;
> >> +
> >> + XA_STATE(xas, &mapping->i_pages, start);
> >
> > XA_STATE is a variable definition, so I think the empty line should be
> > moved down here.
>
> Done.
>
> >
> >> + if (!shmem_mapping(mapping))
> >> + return -EINVAL;
> >> +
> >> + if (!total_swap_pages)
> >> + return 0;
> >> +
> >> + lru_add_drain();
> >> +
> >> + rcu_read_lock();
> >> + xas_for_each(&xas, page, end) {
> >> + if (xas_retry(&xas, page))
> >> + continue;
> >> + if (xa_is_value(page))
> >> + continue;
> >> + if (isolate_lru_page(page))
> >> + continue;
> >
> > isolate_lru_page comment says that it "Must be called with an elevated
> > refcount on the page". Are you ensuring that?
>
> Agree here. Will update it accordingly.
>
> > Also not sure if you need to isolate the page before this writeback.
>
> Isolation may not be required but this is to avoid trying of the reclaim
> on the same page in some parallel path as both any way places the page
> in the swap. Or you want me to remove the isolation here?

Maybe you could isolate them in bulk then? Seems like a waste
isolating them one-by-one.

> >
> >> +
> >> + inc_node_page_state(page, NR_ISOLATED_ANON +
> >> + page_is_file_lru(page));
> >> + lock_page(page);
> >> + ClearPageDirty(page);
> >
> > I see that other places use clear_page_dirty_for_io() to clear the
> > dirty flag before writepage(), don't you need it here?
>
> I think I need to do it here. Will update it in V2.
> >
> >> + SetPageReclaim(page);
> >> + ret = shmem_writepage(page, &wbc);
> >> + if (!PageWriteback(page))
> >> + ClearPageReclaim(page);
> >> + if (ret) {
> >> + unlock_page(page);
> >> + putback_lru_page(page);
> >> + dec_node_page_state(page, NR_ISOLATED_ANON +
> >> + page_is_file_lru(page));
> >> + continue;
> >> + }
> >> +
> >> + /*
> >> + * shmem_writepage() place the page in the swapcache.
> >> + * Delete the page from the swapcache and release the
> >> + * page.
> >
> > Won't deleting the page from swap cache interfere with ongoing
> > writeback if it has not yet completed?
>
> I just followed the path of how to reclaim the page:
> shrink_page_list()
> pageout()
> ..........
> __remove_mapping()
> __delete_from_swap_cache()
>
> You see some issue here which I can't understand.

I think shrink_page_list() would not call __remove_mapping() if after
pageout() PageWriteback() is still true. Maybe I'm missing some path?

>
>
> >
> >> + */
> >> + lock_page(page);
> >
> > You have the page already locked, the above seems extra.
>
> I think shmem_writepage unlocks the page If the page write is successful.

Yes, you are right, I missed that.

>
> >
> >> + delete_from_swap_cache(page);
> >> + unlock_page(page);
> >> + dec_node_page_state(page, NR_ISOLATED_ANON +
> >> + page_is_file_lru(page));
> >> + put_page(page);
> >> + if (need_resched()) {
> >> + xas_pause(&xas);
> >> + cond_resched_rcu();
> >> + }
> >> + }
> >> + rcu_read_unlock();
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int shmem_fadvise_willneed(struct address_space *mapping,
> >> + pgoff_t start, pgoff_t long end)
> >> +{
> >> + struct page *page;
> >> +
> >> + XA_STATE(xas, &mapping->i_pages, start);
> >
> > Empty line should be moved here.
>
> Done.
>
> >
> >> + rcu_read_lock();
> >> + xas_for_each(&xas, page, end) {
> >> + if (!xa_is_value(page))
> >> + continue;
> >> + page = shmem_read_mapping_page(mapping, xas.xa_index);
> >> + if (!IS_ERR(page))
> >> + put_page(page);
> >> +
> >> + if (need_resched()) {
> >> + xas_pause(&xas);
> >> + cond_resched_rcu();
> >> + }
> >> + }
> >> + rcu_read_unlock();
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int shmem_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> >> +{
> >> + loff_t endbyte;
> >> + pgoff_t start_index;
> >> + pgoff_t end_index;
> >> + struct address_space *mapping;
> >> + int ret = 0;
> >> +
> >> + mapping = file->f_mapping;
> >> + if (!mapping || len < 0)
> >> + return -EINVAL;
> >> +
> >> + endbyte = (u64)offset + (u64)len;
> >
> > Why are these typecasts needed? All involved vars seem to be of the
> > same loff_t type. Are you worried about overflow after addition? If so
> > why not just:
> >
> > if (offset + len < 0) return -EINVAL;
> >
> >
> >> + if (!len || endbyte < len)
> >> + endbyte = -1;
> >> + else
> >> + endbyte--;
> >
> > You already checked that len is not negative. You could simply add:
> >
> > if (!len) return 0;
>
> I just followed the logic mentioned in mm/fadvise.c
> @https://elixir.bootlin.com/linux/v5.15/source/mm/fadvise.c#L68
>
> /*
> * Careful about overflows. Len == 0 means "as much as possible". Use
> * unsigned math because signed overflows are undefined and UBSan
> * complains.
> */
> endbyte = (u64)offset + (u64)len;
> if (!len || endbyte < len)
> endbyte = -1;
> else
> endbyte--; /* inclusive */
>
> >
> >> +
> >> +
> >> + start_index = offset >> PAGE_SHIFT;
> >> + end_index = endbyte >> PAGE_SHIFT;
> >> + switch (advice) {
> >> + case POSIX_FADV_DONTNEED:
> >> + ret = shmem_fadvise_dontneed(mapping, start_index, end_index);
> >> + break;
> >> + case POSIX_FADV_WILLNEED:
> >> + ret = shmem_fadvise_willneed(mapping, start_index, end_index);
> >> + break;
> >> + case POSIX_FADV_NORMAL:
> >> + case POSIX_FADV_RANDOM:
> >> + case POSIX_FADV_SEQUENTIAL:
> >> + case POSIX_FADV_NOREUSE:
> >> + /*
> >> + * No bad return value, but ignore advice. May have to
> >> + * implement in future.
> >> + */
> >
> > Returning ENOSYS might be better here IMHO. When some of these options
> > are implemented, the userspace will be able to detect if the kernel
> > supports it by checking for that error code. If you return 0 in both
> > cases such detection will be impossible.
> >
> >> + break;
>
> Just followed what is mentioned @
> https://elixir.bootlin.com/linux/v5.15/source/mm/fadvise.c#L61.
>
>
> >> + default:
> >> + return -EINVAL;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
> >> {
> >> struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
> >> @@ -3799,6 +3937,7 @@ static const struct file_operations shmem_file_operations = {
> >> .splice_write = iter_file_splice_write,
> >> .fallocate = shmem_fallocate,
> >> #endif
> >> + .fadvise = shmem_fadvise,
> >> };
> >>
> >> static const struct inode_operations shmem_inode_operations = {
> >> --
> >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> >> member of the Code Aurora Forum, hosted by The Linux Foundation
> >>
> >>
>
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
> Forum, a Linux Foundation Collaborative Project