Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - goingbeyond 4096 bytes
From: Matthew Wilcox
Date: Wed Jan 29 2014 - 23:53:27 EST
On Fri, Jan 24, 2014 at 10:57:48AM +0000, Mel Gorman wrote:
> So far on the table is
>
> 1. major filesystem overhawl
> 2. major vm overhawl
> 3. use compound pages as they are today and hope it does not go
> completely to hell, reboot when it does
Is the below paragraph an exposition of option 2, or is it an option 4,
change the VM unit of allocation? Other than the names you're using,
this is basically what I said to Kirill in an earlier thread; either
scrap the difference between PAGE_SIZE and PAGE_CACHE_SIZE, or start
making use of it.
The fact that EVERYBODY in this thread has been using PAGE_SIZE when they
should have been using PAGE_CACHE_SIZE makes me wonder if part of the
problem is that the split in naming went the wrong way. ie use PTE_SIZE
for 'the amount of memory pointed to by a pte_t' and use PAGE_SIZE for
'the amount of memory described by a struct page'.
(we need to remove the current users of PTE_SIZE; sparc32 and powerpc32,
but that's just a detail)
And we need to fix all the places that are currently getting the
distinction wrong. SMOP ... ;-) What would help is correct typing of
variables, possibly with sparse support to help us out. Big Job.
> That's why I suggested that it may be necessary to change the basic unit of
> allocation the kernel uses to be larger than the MMU page size and restrict
> how the sub pages are used. The requirement is to preserve the property that
> "with the exception of slab reclaim that any reclaim action will result
> in K-sized allocation succeeding" where K is the largest blocksize used by
> any underlying storage device. From an FS perspective then certain things
> would look similar to what they do today. Block data would be on physically
> contiguous pages, buffer_heads would still manage the case where block_size
> <= PAGEALLOC_PAGE_SIZE (as opposed to MMU_PAGE_SIZE), particularly for
> dirty tracking and so on. The VM perspective is different because now it
> has to handle MMU_PAGE_SIZE in a very different way, page reclaim of a page
> becomes multiple unmap events and so on. There would also be anomalies such
> as mlock of a range smaller than PAGEALLOC_PAGE_SIZE becomes difficult if
> not impossible to sensibly manage because mlock of a 4K page effectively
> pins the rest and it's not obvious how we would deal with the VMAs in that
> case. It would get more than just the storage gains though. Some of the
> scalability problems that deal with massive amount of struct pages may
> magically go away if the base unit of allocation and management changes.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/