Re: [RFC] free_pages stuff

From: Al Viro
Date: Tue Dec 22 2015 - 16:04:44 EST


On Tue, Dec 22, 2015 at 09:21:14AM +0100, Geert Uytterhoeven wrote:
> On Tue, Dec 22, 2015 at 3:22 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >> And if the code really explicitly wants a page (or set of aligned pages)
> >> for some vm reason, I suspect having the cast there isn't a bad thing. It's
> >> clearly not just a random pointer allocation if the bit pattern of the
> >> pointer matters.
> >
> > BTW, I'm not sure we don't have code that would assume that
> > kmalloc(PAGE_SIZE,...) always returns something PAGE_SIZE-aligned.
>
> Yeah, needs-to-be-PAGE_SIZE-aligned is probably one of the main
> reasons of not calling kmalloc().

FWIW, looking through the fs/* uses of __get_free_pages() and its wrappers...
* affs_grow_extcache() - might as well have been kmalloc (and
I'm not sure that fixed PAGE_SIZE is the best size there, actually).
* afs_mntpt_do_automount() - kmalloc().
* bfs_dump_imap() - kmalloc().
* bm_entry_read() - kmalloc(), and might be better off with
single_open-style seqfile.
* ceph_alloc_readdir_reply_buffer() - kmalloc().
* configfs fill_write_buffer() - kmalloc(), and it might be worth
a helper similar to memdup_user() that would allocate size + 1 bytes,
copy size bytes from user and put '\0' after them. There's a lot of
->write() instances open-coding that.
* configfs fill_read_buffer() - kmalloc().
* configfs_follow_link() - kmalloc().
* ext4_calculate_overhead() - kmalloc().
* fuse_follow_link() - kmalloc().
* fuse_do_ioctl() - kmalloc(), and it might as well do more accurate
size calculation
* hfs_mdb_get() - 8Kb kmalloc().
* isofs_readdir() - kmalloc().
* jbd2_alloc() - really wants alignment; might make sense to have
3 more private slabs there (they are using kmem_cache_alloc() for sub-page
allocation, with explicit alignments set when creating those; anything
from 8 pages and up goes to vmalloc()).
* jfs_readdir() - kmalloc().
* jfs lbmLogInit() - alloc_page(). _This_ is one that really wants
struct page (and uses virt_to_page() to get it after get_zeroed_page()).
* kernfs_iop_follow_link() - kmalloc().
* simple_transaction_get() - kmalloc().
* copy_mount_options() - kmalloc().
* nfs_do_submount() - kmalloc().
* nfs_follow_referral() - kmalloc().
* nfs4_replace_transport() - kmalloc().
* nfs_show_devname() - kmalloc().
* nfsd_buffered_readdir() - kmalloc().
* nilfs_ioctl_wrap_copy() - kmalloc() (and I'm not sure that PAGE_SIZE
is the best possible variant there).
* fs/ocfs2/dlm/dlmdebug.c ones - kmalloc(), all of them.
* dlm_alloc_pagevec() - used to allocate hash tables. kmalloc() will
definitely do, but I would consider using a single kmalloc or vmalloc instead
of an array of page-sized blocks. Guaranteed extra dereference on every
hash lookup is potentially painful.
* dlm_migrate_lockres() - kmalloc().
* dlm_request_all_locks_handler() - kmalloc().
* ovl_read_symlink() - kmalloc().
* do_proc_readlink() - kmalloc().
* proc_pid_cmdline_read() - kmalloc().
* proc_pid_attr_write() - memdup_user().
* mem_rw() - kmalloc().
* environ_read() - kmalloc().
* dquot_init() - hash allocation; alloc_large_system_hash(), unless
there's something I'm missing...
* poll_get_entry() - kmalloc(), probably.
* fs/proc/vmcore.c - interesting one; it allocates a buffer to
store elf headers from crashdump, and everything would be simple, expect
for mmap() support in there. Which wants it to be page-aligned and uses
remap_pfn_range() from that sucker.

IOW, there is one place that can't live with kmalloc() due to alignment
requirements (jbd2_alloc()), one place that wants struct page * (in jfs)
and one place that wants page-aligned buffer it will feed to remap_pfn_range().
And 35 places that have no good reason for using __get_free_pages() or
its wrappers.

So at least for fs/* the answer is definitely "almost all places
where we are using page allocator would be better off with something
else".

Documentation/which-allocator-should-I-use might be a good idea... Notes
below are just a skeleton - a lot of details need to be added; in particular,
there should be a part on "I have this kind of address and I want that;
when and how should that be done?", completely missing here. And there
should be a big scary warning along the lines of "this is NOT an invitation
for a flood of checkpatch-inspired patches"...

Comments, corrections and additions would be very welcome.

1) Most of the time kmalloc() is the right thing to use.
Limitations: alignment is no better than word, not available very early in
bootstrap, allocated memory is physically contiguous, so large allocations
are best avoided.

2) kmem_cache_alloc() allows to specify the alignment at cache creation
time. Otherwise it's similar to kmalloc(). Normally it's used for
situations where we have a lot of instances of some type and want dynamic
allocation of those.

3) vmalloc() is for large allocations. They will be page-aligned,
but *not* physically contiguous. OTOH, large physically contiguous
allocations are generally a bad idea. Unlike other allocators, there's
no variant that could be used in interrupt; freeing is possible there,
but allocation is not. Note that non-blocking variant *does* exist -
__vmalloc(size, GFP_ATOMIC, PAGE_KERNEL) can be used in atomic
contexts; it's the interrupt ones that are no-go.

4) if it's very early in bootstrap, alloc_bootmem() and friends
may be the only option. Rule of the thumb: if it's already printed
Memory: ...../..... available.....
you shouldn't be using that one. Allocations are physically contiguous
and at that point large physically contiguous allocations are still OK.

5) if you need to allocate memory for DMA, use dma_alloc_coherent()
and friends. They'll give you both the virtual address for your use
and DMA address refering to the same memory for use by device; do *NOT*
try to derive the latter from the former; use of virt_to_bus() et.al.
is a Bloody Bad Idea(tm).

6) if you need a reference to struct page, use alloc_page/alloc_pages.

7) in some cases (page tables, for the most obvious example), __get_free_page()
and friends might be the right answer. In principle, it's case (6), but
it returns page_address(page) instead of the page itself. Historically that
was the first API introduced, so a _lot_ of places that should've been using
something else ended up using that. Do not assume that being lower level
makes it faster than e.g. kmalloc() - this is simply not true.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/