[PATCHv4 00/43] ext4: support of huge pages

From: Kirill A. Shutemov
Date: Mon Oct 24 2016 - 20:19:42 EST


Here's respin of my huge ext4 patchset on top of v4.9-rc2 with couple of
fixes (see below).

Please review and consider applying.

I don't see any xfstests regressions with huge pages enabled. Patch with
new configurations for xfstests-bld is below.

The basics are the same as with tmpfs[1] which is in Linus' tree now and
ext4 built on top of it. The main difference is that we need to handle
read out from and write-back to backing storage.

As with other THPs, the implementation is build around compound pages:
a naturally aligned collection of pages that memory management subsystem
[in most cases] treat as a single entity:

- head page (the first subpage) on LRU represents whole huge page;
- head page's flags represent state of whole huge page (with few
exceptions);
- mm can't migrate subpages of the compound page individually;

For THP, we use PMD-sized huge pages.

Head page links buffer heads for whole huge page. Dirty/writeback/etc.
tracking happens on per-hugepage level as all subpages share the same page
flags.

lock_page() on any subpage would lock whole hugepage for the same reason.

On radix-tree, a huge page represented as a multi-order entry of the same
order (HPAGE_PMD_ORDER). This allows us to track dirty/writeback on
radix-tree tags with the same granularity as on struct page.

We read out whole huge page at once. It required bumping BIO_MAX_PAGES to
not less than HPAGE_PMD_NR. I defined BIO_MAX_PAGES to HPAGE_PMD_NR if
huge pagecache enabled.

On IO via syscalls, we are still limited by copying upto PAGE_SIZE per
iteration. The limitation here comes from how copy_page_to_iter() and
copy_page_from_iter() work wrt. highmem: it can only handle one small
page a time.

On write side, we also have problem with assuming small pages: write
length and offset within page calculated before we know if small or huge
page is allocated. It's not easy to fix. Looks like it would require
change in ->write_begin() interface to accept len > PAGE_SIZE.

On split_huge_page() we need to free buffers before splitting the page.
Page buffers takes additional pin on the page and can be a vector to mess
with the page during split. We want to avoid this.
If try_to_free_buffers() fails, split_huge_page() would return -EBUSY.

Readahead doesn't play with huge pages well: 128k max readahead window,
assumption on page size, PageReadahead() to track hit/miss. I've got it
to allocate huge pages, but it doesn't provide any readahead as such.
I don't know how to do this right. It's not clear at this point if we
really need readahead with huge pages. I guess it's good enough for now.

Shadow entries ignored on allocation -- recently evicted page is not
promoted to active list. Not sure if current workingset logic is adequate
for huge pages. On eviction, we split the huge page and setup 4k shadow
entries as usual.

Unlike tmpfs, ext4 makes use of tags in radix-tree. The approach I used
for tmpfs -- 512 entries in radix-tree per-hugepages -- doesn't work well
if we want to have coherent view on tags. So the first 8 patches of the
patchset converts tmpfs to use multi-order entries in radix-tree.
The same infrastructure used for ext4.

Encryption doesn't handle huge pages yet. To avoid regressions we just
disable huge pages for the inode if it has EXT4_INODE_ENCRYPT.

Tested with 4k, 1k, encryption and bigalloc. All with and without
huge=always. I think it's reasonable coverage.

The patchset is also in git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugeext4/v4

[1] http://lkml.kernel.org/r/1465222029-45942-1-git-send-email-kirill.shutemov@xxxxxxxxxxxxxxx

Changes since v3:
- account huge page to dirty/writeback/reclaimable/etc. according to its
size. It fixes background writback.
- move code that adds huge page to radix-tree to
page_cache_tree_insert() (Jan);
- make ramdisk work with huge pages;
- fix unaccont of shadow entries (Jan);
- use try_to_release_page() instead of try_to_free_buffers() in
split_huge_page() (Jan);
- make thp_get_unmapped_area() respect S_HUGE_MODE;
- use huge-page aligned address to zap page range in wp_huge_pmd();
- use ext4_kvmalloc in ext4_mpage_readpages() instead of
kmalloc() (Andreas);

Changes since v2:
- fix intermittent crash in generic/299;
- typo (condition inversion) in do_generic_file_read(),
reported by Jitendra;

TODO:
- on IO via syscalls, copy more than PAGE_SIZE per iteration to/from
userspace;
- readahead ?;
- wire up madvise()/fadvise();
- encryption with huge pages;
- reclaim of file huge pages can be optimized -- split_huge_page() is not
required for pages with backing storage;