Re: [RFC PATCH] xfs: support for non-mmu architectures

From: Dave Chinner
Date: Fri Nov 20 2015 - 15:36:25 EST

On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
> > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> > > > Naive implementation for non-mmu architectures: allocate physically
> > > > contiguous xfs buffers with alloc_pages. Terribly inefficient with
> > > > memory and fragmentation on high I/O loads but it may be good enough
> > > > for basic usage (which most non-mmu architectures will need).
> > > >
> > > > This patch was tested with lklfuse [1] and basic operations seems to
> > > > work even with 16MB allocated for LKL.
> > > >
> > > > [1]
> > > >
> > > > Signed-off-by: Octavian Purdila <octavian.purdila@xxxxxxxxx>
> > > > ---
> > >
> > > Interesting, though this makes me wonder why we couldn't have a new
> > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> > > familiar with mmu-less context, but I see that mm/nommu.c has a
> > > __vmalloc() interface that looks like it ultimately translates into an
> > > alloc_pages() call. Would that accomplish what this patch is currently
> > > trying to do?
> >
> > vmalloc is always a last resort. vmalloc space on 32 bit systems is
> > extremely limited and it is easy to exhaust with XFS.
> >
> Sure, but my impression is that a vmalloc() buffer is roughly equivalent
> in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We
> just do the allocation and mapping separately (presumably for other
> reasons).

Yes, it'a always a last resort. We don't use vmap'd buffers very
much on block size <= page size filesystems (e.g. iclog buffers are
the main user in such cases, IIRC), so the typical 32 bit
system doesn't have major problems with vmalloc space. However, the
moment you increase the directory block size > block size, that all
goes out the window...

> > Also, vmalloc limits the control we have over allocation context
> > (e.g. the hoops we jump through in kmem_alloc_large() to maintain
> > GFP_NOFS contexts), so just using vmalloc doesn't make things much
> > simpler from an XFS perspective.
> >
> The comment in kmem_zalloc_large() calls out some apparent hardcoded
> allocation flags down in the depths of vmalloc(). It looks to me that
> page allocation (__vmalloc_area_node()) actually uses the provided
> flags, so I'm not following the "data page" part of that comment.

You can pass gfp flags for the page allocation part of vmalloc, but
not the pte allocation part of it. That's what the hacks in
kmem_zalloc_large() are doing.

> Indeed, I do see that this is not the case down in calls like
> pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page
> table management.


> Those latter calls are all from following down through the
> map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> down into the same code. Indeed, we already protect ourselves here via
> the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.

Yes, we do, but that is separately handled to the allocation of the
pages, which we have to do for all types of buffers, mapped or
unmapped, because xfs_buf_ioapply_map() requires direct access to
the underlying pages to build the bio for IO. If we delegate the
allocation of pages to vmalloc, we don't have direct reference to
the underlying pages and so we have to do something completely
diffferent to build the bios for the buffer....

> I suspect there's more to it than that because it does look like
> vm_map_ram() has a different mechanism for managing vmalloc space for
> certain (smaller) allocations, either of which I'm not really familiar
> with.

Yes, it manages vmalloc space quite differently, and there are
different scalability aspects to consider as well - vm_map_ram was
pretty much written for the use XFS has in xfs_buf.c...


Dave Chinner
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at