Re: RFC: Block reservation for hugetlbfs

From: Nick Piggin
Date: Tue Feb 21 2006 - 02:49:10 EST


David Gibson wrote:
These days, hugepages are demand-allocated at first fault time.
There's a somewhat dubious (and racy) heuristic when making a new
mmap() to check if there are enough available hugepages to fully
satisfy that mapping.

A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the blocks in between allocations. In this case the
size of each block is compared against the total number of available
hugepages. It's thus easy for the process to become overcommitted,
because each block mapping will succeed, although the total number of
hugepages required by all blocks exceeds the number available. In
particular, this defeats such a program which will detect a mapping
failure and adjust its hugepage usage downward accordingly.

The patch below is a draft attempt to address this problem, by
strictly reserving a number of physical hugepages for hugepages inodes
which have been mapped, but not instatiated. MAP_SHARED mappings are
thus "safe" - they will fail on mmap(), not later with a SIGBUS.
MAP_PRIVATE mappings can still SIGBUS.

This patch appears to address the problem at hand - it allows DB2 to
start correctly, for instance, which previously suffered the failure
described above. I'm almost certain I'm missing some locking or other
synchronization - I am entirely bewildered as to what I need to hold
to safely update i_blocks as below. Corrections for my ignorance
solicited...

Signed-off-by: David Gibson <dwg@xxxxxxxxxxx>


This introduces
tree_lock(r) -> hugetlb_lock

And we already have
hugetlb_lock -> lru_lock

So we now have tree_lock(r) -> lru_lock, which would deadlock
against lru_lock -> tree_lock(w), right?

From a quick glance it looks safe, but I'd _really_ rather not
introduce something like this.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/