Re: [PATCH] Do not account for the address space used by hugetlbfsusing VM_ACCOUNT V2 (Was Linus 2.6.29-rc4)

From: Mel Gorman
Date: Wed Feb 11 2009 - 05:30:22 EST


On Wed, Feb 11, 2009 at 09:43:29AM +0000, Andy Whitcroft wrote:
> On Tue, Feb 10, 2009 at 02:02:27PM +0000, Mel Gorman wrote:
> > On Sun, Feb 08, 2009 at 01:01:04PM -0800, Linus Torvalds wrote:
> > >
> > > Another week (and a half), another -rc.
> > >
> > > Arch updates (sparc, blackfin), driver updates (dvb, mmc, ide), ACPI
> > > updates, FS updates (ubifs, btrfs), you name it. It's all there.
> > >
> > > But more importantly, people really have been working on regressions, and
> > > hopefully this closes a nice set of the top one, and hopefully without
> > > introducing too many new ones.
> > >
> >
> > Hugepages are currently busted due to commit
> > fc8744adc870a8d4366908221508bb113d8b72ee and the problem is in
> > 2.6.29-rc4. There was a bit of a discussion but it didn't get very far and
> > then I went offline for the weekend. Here is a revised patch that tries to
> > bring hugetlbfs more in line with what the core VM is doing by dealing with
> > VM_ACCOUNT and VM_NORESERVE.
> >
> > =============
> > Do not account for the address space used by hugetlbfs using VM_ACCOUNT
> >
> > When overcommit is disabled, the core VM accounts for pages used by anonymous
> > shared, private mappings and special mappings. It keeps track of VMAs that
> > should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
> > with VM_NORESERVE.
> >
> > Overcommit for hugetlbfs is much riskier than overcommit for base pages
> > due to contiguity requirements. It avoids overcommiting on both shared and
> > private mappings using reservation counters that are checked and updated
> > during mmap(). This ensures (within limits) that hugepages exist in the
> > future when faults occurs or it is too easy to applications to be SIGKILLed.
> >
> > As hugetlbfs makes its own reservations of a different unit to the base page
> > size, VM_ACCOUNT should never be set. Even if the units were correct, we would
> > double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
> > be set because an application can request no reserves be made for hugetlbfs
> > at the risk of getting killed later.
> >
> > With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
> > VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
> > breaks the accounting for both the core VM and hugetlbfs, can trigger an
> > OOM storm when hugepage pools are too small lockups and corrupted counters
> > otherwise are used. This patch brings hugetlbfs more in line with how the
> > core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
> >
> > Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
> > ---
> > fs/hugetlbfs/inode.c | 8 +++++---
> > include/linux/hugetlb.h | 5 +++--
> > include/linux/mm.h | 3 +--
> > ipc/shm.c | 8 +++++---
> > mm/fremap.c | 2 +-
> > mm/hugetlb.c | 39 +++++++++++++++++++++++++--------------
> > mm/mmap.c | 38 ++++++++++++++++++++++----------------
> > mm/mprotect.c | 5 +++--
> > 8 files changed, 65 insertions(+), 43 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 6903d37..9b800d9 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -108,7 +108,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> >
> > if (hugetlb_reserve_pages(inode,
> > vma->vm_pgoff >> huge_page_order(h),
> > - len >> huge_page_shift(h), vma))
> > + len >> huge_page_shift(h), vma,
> > + vma->vm_flags))
> > goto out;
> >
> > ret = 0;
> > @@ -947,7 +948,7 @@ static int can_do_hugetlb_shm(void)
> > can_do_mlock());
> > }
> >
> > -struct file *hugetlb_file_setup(const char *name, size_t size)
> > +struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag)
> > {
> > int error = -ENOMEM;
> > struct file *file;
> > @@ -981,7 +982,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
> >
> > error = -ENOMEM;
> > if (hugetlb_reserve_pages(inode, 0,
> > - size >> huge_page_shift(hstate_inode(inode)), NULL))
> > + size >> huge_page_shift(hstate_inode(inode)), NULL,
> > + acctflag))
> > goto out_inode;
> >
> > d_instantiate(dentry, inode);
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index f1d2fba..af09660 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -33,7 +33,8 @@ unsigned long hugetlb_total_pages(void);
> > int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long address, int write_access);
> > int hugetlb_reserve_pages(struct inode *inode, long from, long to,
> > - struct vm_area_struct *vma);
> > + struct vm_area_struct *vma,
> > + int acctflags);
> > void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
> >
> > extern unsigned long hugepages_treat_as_movable;
> > @@ -138,7 +139,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
> >
> > extern const struct file_operations hugetlbfs_file_operations;
> > extern struct vm_operations_struct hugetlb_vm_ops;
> > -struct file *hugetlb_file_setup(const char *name, size_t);
> > +struct file *hugetlb_file_setup(const char *name, size_t, int);
> > int hugetlb_get_quota(struct address_space *mapping, long delta);
> > void hugetlb_put_quota(struct address_space *mapping, long delta);
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e8ddc98..3235615 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1129,8 +1129,7 @@ extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> > unsigned long flag, unsigned long pgoff);
> > extern unsigned long mmap_region(struct file *file, unsigned long addr,
> > unsigned long len, unsigned long flags,
> > - unsigned int vm_flags, unsigned long pgoff,
> > - int accountable);
> > + unsigned int vm_flags, unsigned long pgoff);
> >
> > static inline unsigned long do_mmap(struct file *file, unsigned long addr,
> > unsigned long len, unsigned long prot,
> > diff --git a/ipc/shm.c b/ipc/shm.c
> > index f8f69fa..05d51d2 100644
> > --- a/ipc/shm.c
> > +++ b/ipc/shm.c
> > @@ -340,6 +340,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> > struct file * file;
> > char name[13];
> > int id;
> > + int acctflag = 0;
> >
> > if (size < SHMMIN || size > ns->shm_ctlmax)
> > return -EINVAL;
> > @@ -364,11 +365,12 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> >
> > sprintf (name, "SYSV%08x", key);
> > if (shmflg & SHM_HUGETLB) {
> > - /* hugetlb_file_setup takes care of mlock user accounting */
> > - file = hugetlb_file_setup(name, size);
> > + /* hugetlb_file_setup applies strict accounting */
> > + if (shmflg & SHM_NORESERVE)
> > + acctflag = VM_NORESERVE;
> > + file = hugetlb_file_setup(name, size, acctflag);
>
> Does this change semantics? At first look it appears that you are now
> implementing SHM_NORESERVE for SHM_HUGETLB where we did not previously
> do so? Not that this is necesarily a bad thing, just feels like its
> changing.
>

Yes, semantics for shared mappings are changed so that they obey
SHM_NORESERVE. This was part of bringing hugetlbfs closer to the
behaviour of the core VM but I should have called it out better in the
changelog.

> > shp->mlock_user = current_user();
> > } else {
> > - int acctflag = 0;
> > /*
> > * Do not allow no accounting for OVERCOMMIT_NEVER, even
> > * if it's asked for.
> > diff --git a/mm/fremap.c b/mm/fremap.c
> > index 736ba7f..b6ec85a 100644
> > --- a/mm/fremap.c
> > +++ b/mm/fremap.c
> > @@ -198,7 +198,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
> > flags &= MAP_NONBLOCK;
> > get_file(file);
> > addr = mmap_region(file, start, size,
> > - flags, vma->vm_flags, pgoff, 1);
> > + flags, vma->vm_flags, pgoff);
> > fput(file);
> > if (IS_ERR_VALUE(addr)) {
> > err = addr;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 618e983..2074642 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2269,14 +2269,12 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
> >
> > int hugetlb_reserve_pages(struct inode *inode,
> > long from, long to,
> > - struct vm_area_struct *vma)
> > + struct vm_area_struct *vma,
> > + int acctflag)
> > {
> > - long ret, chg;
> > + long ret = 0, chg;
> > struct hstate *h = hstate_inode(inode);
> >
> > - if (vma && vma->vm_flags & VM_NORESERVE)
> > - return 0;
> > -
> > /*
> > * Shared mappings base their reservation on the number of pages that
> > * are already allocated on behalf of the file. Private mappings need
> > @@ -2285,22 +2283,25 @@ int hugetlb_reserve_pages(struct inode *inode,
> > */
> > if (!vma || vma->vm_flags & VM_SHARED)
> > chg = region_chg(&inode->i_mapping->private_list, from, to);
>
> I thought the region map for a VM_SHARED mapping is meant to contain
> those pages for which we already have a reservation allocated. So that
> if we have overlapping VM_RESERVE and VM_NORESERVE mappings we know that
> we did have a page reserved at fault time and know whether we can take
> it from the reserve portion of the pool. By letting this get executed
> for VM_NORESERVE mappings that would seem to be getting out of sync,
> which doesn't sound right.

This part doesn't get executed for NORESERVE so while region_chg() took
place to calculate a reservation, region_add() did not get called to
commit it.

> I don't see any updates to the documentation
> on the meanings of reservation map if you are changing it?
>
> Here is the commentry from mm/hugetlb.c:
>
> * The private mapping reservation is represented in a subtly different
> * manner to a shared mapping. A shared mapping has a region map associated
> * with the underlying file, this region map represents the backing file
> * pages which have ever had a reservation assigned which this persists even
> * after the page is instantiated. A private mapping has a region map
> * associated with the original mmap which is attached to all VMAs which
> * reference it, this region map represents those offsets which have consumed
> * reservation ie. where pages have been instantiated.
>

This is still accurate.

> > - else {
> > - struct resv_map *resv_map = resv_map_alloc();
> > - if (!resv_map)
> > - return -ENOMEM;
> > -
> > + else
> > chg = to - from;
> >
> > - set_vma_resv_map(vma, resv_map);
> > - set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> > - }
> > -
> > if (chg < 0)
> > return chg;
> >
> > if (hugetlb_get_quota(inode->i_mapping, chg))
> > return -ENOSPC;
> > +
> > + /*
> > + * Only apply hugepage reservation if asked. We still have to
> > + * take the filesystem quota because it is an upper limit
> > + * defined for the mount and not necessarily memory as a whole
> > + */
> > + if (acctflag & VM_NORESERVE) {
> > + reset_vma_resv_huge_pages(vma);
>
> Why do we now need to do this in the non-reserve case? We didn't need
> to do it before.
>

True, it was largely defensive against anything being in page_private
that would make it think it had reserves.

> > + return 0;
> > + }
> > +
>
> This also seems like a semantic change. Previously NO_RESERVE did not
> take quota, now it does. NO_RESERVE used to mean that we took our
> chances on there being pages available at fault time both for quota and
> in the pools. Now it means that we only risk there being no pages.
> Does that not significantly change semantics.

Yes, and this was a mistake. For noreserve mappings, we may now be taking
twice the amount of quota and probably leaking it. This is wrong and I need
to move the check for quota below the check for VM_NORESERVE. Good spot.

> The main use case for
> NO_RESERVE was mapping a massive area to expand into, this might now be
> refused for quota when previously it was not.
>
> > ret = hugetlb_acct_memory(h, chg);
> > if (ret < 0) {
> > hugetlb_put_quota(inode->i_mapping, chg);
> > @@ -2308,6 +2309,16 @@ int hugetlb_reserve_pages(struct inode *inode,
> > }
> > if (!vma || vma->vm_flags & VM_SHARED)
> > region_add(&inode->i_mapping->private_list, from, to);
> > + else {
> > + struct resv_map *resv_map = resv_map_alloc();
> > +
> > + if (!resv_map)
> > + return -ENOMEM;
> > +
> > + set_vma_resv_map(vma, resv_map);
> > + set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> > + }
> > +
> > return 0;
> > }
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 214b6a2..6f70db8 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -918,7 +918,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> > struct inode *inode;
> > unsigned int vm_flags;
> > int error;
> > - int accountable = 1;
> > unsigned long reqprot = prot;
> >
> > /*
> > @@ -1019,8 +1018,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> > return -EPERM;
> > vm_flags &= ~VM_MAYEXEC;
> > }
> > - if (is_file_hugepages(file))
> > - accountable = 0;
> >
> > if (!file->f_op || !file->f_op->mmap)
> > return -ENODEV;
> > @@ -1053,8 +1050,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> > if (error)
> > return error;
> >
> > - return mmap_region(file, addr, len, flags, vm_flags, pgoff,
> > - accountable);
> > + return mmap_region(file, addr, len, flags, vm_flags, pgoff);
> > }
> > EXPORT_SYMBOL(do_mmap_pgoff);
> >
> > @@ -1092,17 +1088,23 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
> >
> > /*
> > * We account for memory if it's a private writeable mapping,
> > - * and VM_NORESERVE wasn't set.
> > + * not hugepages and VM_NORESERVE wasn't set.
> > */
> > -static inline int accountable_mapping(unsigned int vm_flags)
> > +static inline int accountable_mapping(struct file *file, unsigned int vm_flags)
> > {
> > + /*
> > + * hugetlb has its own accounting separate from the core VM
> > + * VM_HUGETLB may not be set yet so we cannot check for that flag.
> > + */
> > + if (file && is_file_hugepages(file))
> > + return 0;
> > +
> > return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> > }
> >
> > unsigned long mmap_region(struct file *file, unsigned long addr,
> > unsigned long len, unsigned long flags,
> > - unsigned int vm_flags, unsigned long pgoff,
> > - int accountable)
> > + unsigned int vm_flags, unsigned long pgoff)
> > {
> > struct mm_struct *mm = current->mm;
> > struct vm_area_struct *vma, *prev;
> > @@ -1128,18 +1130,22 @@ munmap_back:
> >
> > /*
> > * Set 'VM_NORESERVE' if we should not account for the
> > - * memory use of this mapping. We only honor MAP_NORESERVE
> > - * if we're allowed to overcommit memory.
> > + * memory use of this mapping.
> > */
> > - if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > - vm_flags |= VM_NORESERVE;
> > - if (!accountable)
> > - vm_flags |= VM_NORESERVE;
> > + if ((flags & MAP_NORESERVE)) {
> > + /* * We honor MAP_NORESERVE if allowed to overcommit */
> > + if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > + vm_flags |= VM_NORESERVE;
> > +
> > + /* hugetlb applies strict overcommit unless MAP_NORESERVE */
> > + if (file && is_file_hugepages(file))
> > + vm_flags |= VM_NORESERVE;
> > + }
> >
> > /*
> > * Private writable mapping: check memory availability
> > */
> > - if (accountable_mapping(vm_flags)) {
> > + if (accountable_mapping(file, vm_flags)) {
> > charged = len >> PAGE_SHIFT;
> > if (security_vm_enough_memory(charged))
> > return -ENOMEM;
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index abe2694..258197b 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -151,10 +151,11 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> > /*
> > * If we make a private mapping writable we increase our commit;
> > * but (without finer accounting) cannot reduce our commit if we
> > - * make it unwritable again.
> > + * make it unwritable again. hugetlb mapping were accounted for
> > + * even if read-only so there is no need to account for them here
> > */
> > if (newflags & VM_WRITE) {
> > - if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
> > + if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
> > VM_SHARED|VM_NORESERVE))) {
> > charged = nrpages;
> > if (security_vm_enough_memory(charged))
> >
>

Thanks, I'll get the quota thing fixed up.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/