Re: RFC: Block reservation for hugetlbfs

From: David Gibson
Date: Tue Feb 21 2006 - 18:44:57 EST


On Tue, Feb 21, 2006 at 11:25:36AM -0800, Dave Hansen wrote:
> On Tue, 2006-02-21 at 13:21 +1100, David Gibson wrote:
> > The patch below is a draft attempt to address this problem, by
> > strictly reserving a number of physical hugepages for hugepages inodes
> > which have been mapped, but not instatiated. MAP_SHARED mappings are
> > thus "safe" - they will fail on mmap(), not later with a SIGBUS.
> > MAP_PRIVATE mappings can still SIGBUS.
> ...
> > + read_lock_irq(&inode->i_mapping->tree_lock);
> > + for (pg = inode->i_blocks; pg < npages; pg++) {
> > + page = radix_tree_lookup(&mapping->page_tree, pg);
> > + if (! page)
> > + change_in_reserve++;
> > + }
> > +
> > + for (pg = npages; pg < inode->i_blocks; pg++) {
> > + page = radix_tree_lookup(&mapping->page_tree, pg);
> > + if (! page)
> > + change_in_reserve--;
> > + }
> > + spin_lock(&hugetlb_lock);
>
> I'm a bit confused by this. The for loops goes through and looks for
> pages that have indexes greater than the current i_blocks, but less than
> the number of pages requested by this reservation. With demand
> faulting, can there be holes in the file? Will this code account for
> them?

There can be holes in terms of which pages are instantiated, and in
terms of what's mapped but with this patch we always reserve space
from the beginning of the file. This is a simplifying assumption, so
that i_blocks alone can store sufficient information about what pages
are pre-reserved. Otherwise we'd either need individual information
about the reservation status of each page - and there's no obvious
place to store it, or we'd have to deduce it in multiple places by
working through each vma which might map the page. Andy Whitcroft's
old hugepage strict reservation patch used the latter approach, I
invented this i_blocks method because it seems to be simpler.

> I think this would also be a bit more clear if the two for loops were a
> bit more explicit in what they are doing. The first is growing the
> reservation when "npages > inode->i_blocks" and the second is shrinking
> the reservation when "npages < inode->i_blocks", right? Is this
> completely clear from the code? ;)

Ok, I'll add some more explanation here.

> Also, since the operation you are performing is actually counting pages
> in the radix tree at certain indexes, would it be sane to have a patch
> such as the (completely untested) attached one to do just that, but in
> the radix code?

Given that the patch below seems to introduce more code that the
above, but scattered across several places, and the hugepage
accounting code would be its only user, I don't see the point.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/