Re: [PATCHv2, 00/41] ext4: support of huge pages

From: Andreas Dilger
Date: Sun Aug 14 2016 - 04:25:57 EST


On Aug 12, 2016, at 12:37 PM, Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> wrote:
>
> Here's stabilized version of my patchset which intended to bring huge pages
> to ext4.
>
> The basics are the same as with tmpfs[1] which is in Linus' tree now and
> ext4 built on top of it. The main difference is that we need to handle
> read out from and write-back to backing storage.
>
> Head page links buffers for whole huge page. Dirty/writeback tracking
> happens on per-hugepage level.
>
> We read out whole huge page at once. It required bumping BIO_MAX_PAGES to
> not less than HPAGE_PMD_NR. I defined BIO_MAX_PAGES to HPAGE_PMD_NR if
> huge pagecache enabled.
>
> On split_huge_page() we need to free buffers before splitting the page.
> Page buffers takes additional pin on the page and can be a vector to mess
> with the page during split. We want to avoid this.
> If try_to_free_buffers() fails, split_huge_page() would return -EBUSY.
>
> Readahead doesn't play with huge pages well: 128k max readahead window,
> assumption on page size, PageReadahead() to track hit/miss. I've got it
> to allocate huge pages, but it doesn't provide any readahead as such.
> I don't know how to do this right. It's not clear at this point if we
> really need readahead with huge pages. I guess it's good enough for now.

Typically read-ahead is a loss if you are able to get large allocations on
disk, since you can get at least seek_rate * chunk_size throughput from the
disks even with random IO at that size. With 1MB allocations and 7200 RPM drives this works out to be about 150MB/s, which is close to the throughput
of these drive already.

Cheers, Andreas

> Shadow entries ignored on allocation -- recently evicted page is not
> promoted to active list. Not sure if current workingset logic is adequate
> for huge pages. On eviction, we split the huge page and setup 4k shadow
> entries as usual.
>
> Unlike tmpfs, ext4 makes use of tags in radix-tree. The approach I used
> for tmpfs -- 512 entries in radix-tree per-hugepages -- doesn't work well
> if we want to have coherent view on tags. So the first 8 patches of the
> patchset converts tmpfs to use multi-order entries in radix-tree.
> The same infrastructure used for ext4.
>
> Encryption doesn't handle huge pages yet. To avoid regressions we just
> disable huge pages for the inode if it has EXT4_INODE_ENCRYPT.
>
> With this version I don't see any xfstests regressions with huge pages enabled.
> Patch with new configurations for xfstests-bld is below.
>
> Tested with 4k, 1k, encryption and bigalloc. All with and without
> huge=always. I think it's reasonable coverage.
>
> The patchset is also in git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugeext4/v2
>
> Please review and consider applying.
>
> [1] http://lkml.kernel.org/r/1465222029-45942-1-git-send-email-kirill.shutemov@xxxxxxxxxxxxxxx

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail