Re: [PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed

From: Kirill A. Shutemov
Date: Mon Feb 13 2017 - 10:20:10 EST


On Thu, Feb 09, 2017 at 01:18:35PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:49PM +0300, Kirill A. Shutemov wrote:
> > Later we can add logic to accumulate information from shadow entires to
> > return to caller (average eviction time?).
>
> I would say minimum rather than average. That will become the refault
> time of the entire page, so minimum would probably have us making better
> decisions?

Yes, makes sense.

> > + /* Wipe shadow entires */
> > + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
> > + page->index) {
> > + if (iter.index >= page->index + hpage_nr_pages(page))
> > + break;
> >
> > p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
> > - if (!radix_tree_exceptional_entry(p))
> > + if (!p)
> > + continue;
>
> Just FYI, this can't happen. You're holding the tree lock so nobody
> else gets to remove things from the tree. radix_tree_for_each_slot()
> only gives you the full slots; it skips the empty ones for you. I'm
> OK if you want to leave it in out of an abundance of caution.

I'll drop it.

> > + __radix_tree_replace(&mapping->page_tree, iter.node, slot, NULL,
> > + workingset_update_node, mapping);
>
> I may add an update_node argument to radix_tree_join at some point,
> so you can use it here. Or maybe we don't need to do that, and what
> you have here works just fine.
>
> > mapping->nrexceptional--;
>
> ... because adjusting the exceptional count is going to be a pain.

Yeah..

> > + error = __radix_tree_insert(&mapping->page_tree,
> > + page->index, compound_order(page), page);
> > + /* This shouldn't happen */
> > + if (WARN_ON_ONCE(error))
> > + return error;
>
> A lesser man would have just ignored the return value from
> __radix_tree_insert. I salute you.
>
> > @@ -2078,18 +2155,34 @@ static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
> > {
> > struct address_space *mapping = file->f_mapping;
> > struct page *page;
> > + pgoff_t hoffset;
> > int ret;
> >
> > do {
> > - page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> > + page = page_cache_alloc_huge(mapping, offset, gfp_mask);
> > +no_huge:
> > + if (!page)
> > + page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> > if (!page)
> > return -ENOMEM;
> >
> > - ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL);
> > - if (ret == 0)
> > + if (PageTransHuge(page))
> > + hoffset = round_down(offset, HPAGE_PMD_NR);
> > + else
> > + hoffset = offset;
> > +
> > + ret = add_to_page_cache_lru(page, mapping, hoffset,
> > + gfp_mask & GFP_KERNEL);
> > +
> > + if (ret == -EEXIST && PageTransHuge(page)) {
> > + put_page(page);
> > + page = NULL;
> > + goto no_huge;
> > + } else if (ret == 0) {
> > ret = mapping->a_ops->readpage(file, page);
> > - else if (ret == -EEXIST)
> > + } else if (ret == -EEXIST) {
> > ret = 0; /* losing race to add is OK */
> > + }
> >
> > put_page(page);
>
> If the filesystem returns AOP_TRUNCATED_PAGE, you'll go round this loop
> again trying the huge page again, even if the huge page didn't work
> the first time. I would tend to think that if the huge page failed the
> first time, we shouldn't try it again, so I propose this:

AOP_TRUNCATED_PAGE is positive, so I don't see how you avoid try_huge on
second iteration. Hm?

>
> struct address_space *mapping = file->f_mapping;
> struct page *page;
> pgoff_t index;
> int ret;
> bool try_huge = true;
>
> do {
> if (try_huge) {
> page = page_cache_alloc_huge(gfp_mask|__GFP_COLD);
> if (page)
> index = round_down(offset, HPAGE_PMD_NR);
> else
> try_huge = false;
> }
>
> if (!try_huge) {
> page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> index = offset;
> }
>
> if (!page)
> return -ENOMEM;
>
> ret = add_to_page_cache_lru(page, mapping, index,
> gfp_mask & GFP_KERNEL);
> if (ret < 0) {
> if (try_huge) {
> try_huge = false;
> ret = AOP_TRUNCATED_PAGE;
> } else if (ret == -EEXIST)
> ret = 0; /* losing race to add is OK */
> } else {
> ret = mapping->a_ops->readpage(file, page);
> }
>
> put_page(page);
> } while (ret == AOP_TRUNCATED_PAGE);
>
> But ... maybe it's OK to retry the huge page. I mean, not many
> filesystems return AOP_TRUNCATED_PAGE, and they only do so rarely.

What about this:

struct address_space *mapping = file->f_mapping;
struct page *page;
pgoff_t hoffset;
int ret;
bool try_huge = true;

do {
if (try_huge) {
page = page_cache_alloc_huge(mapping, offset, gfp_mask);
hoffset = round_down(offset, HPAGE_PMD_NR);

/* Try to allocate huge page once */
try_huge = false;
}

if (!page) {
page = __page_cache_alloc(gfp_mask|__GFP_COLD);
hoffset = offset;
}

if (!page)
return -ENOMEM;

ret = add_to_page_cache_lru(page, mapping, hoffset,
gfp_mask & GFP_KERNEL);

if (ret == -EEXIST && PageTransHuge(page)) {
/* Retry with small page */
put_page(page);
page = NULL;
continue;
} else if (ret == 0) {
ret = mapping->a_ops->readpage(file, page);
} else if (ret == -EEXIST) {
ret = 0; /* losing race to add is OK */
}

put_page(page);
} while (ret == AOP_TRUNCATED_PAGE);

return ret;


> Anyway, I'm fine with the patch going in as-is. I just wanted to type out
> my review notes.
>
> Reviewed-by: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx>

Thanks!

--
Kirill A. Shutemov