On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote:On 01/22/2014 01:13 PM, James Bottomley wrote:Only if you think laying out stuff requires block size changes. If a 4kOn Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote:I think that the key to having the file system work with largerOn Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:I think I might be sceptical, but I don't think that's showing in myOn Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote:[ I like big sectors and I cannot lie ]
concerns ...
If the page cache had a variable granularity per device, that would copeWe're likely to have people mixing 4K drives and <fill in some otherI really think that if we want to make progress on this one, we needDo we even need to do that (eliminate buffer heads)? We cope with 4k
code and someone that owns it. Nick's work was impressive, but it was
mostly there for getting rid of buffer heads. If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.
sector only devices just fine today because the bh mechanisms now
operate on top of the page cache and can do the RMW necessary to update
a bh in the page cache itself which allows us to do only 4k chunked
writes, so we could keep the bh system and just alter the granularity of
the page cache.
size here> on the same box. We could just go with the biggest size and
use the existing bh code for the sub-pagesized blocks, but I really
hesitate to change VM fundamentals for this.
with this. It's the variable granularity that's the VM problem.
From a pure code point of view, it may be less work to change it once inAgreed, but only if we don't do RMW in the buffer cache ... which may be
the VM. But from an overall system impact point of view, it's a big
change in how the system behaves just for filesystem metadata.
a good reason to keep it.
I agree with all of that, but my question is still can we do this byThe other question is if the drive does RMW between 4k and whatever itsThe real benefit is when and how the reads get scheduled. We're able to
physical sector size, do we need to do anything to take advantage of
it ... as in what would altering the granularity of the page cache buy
us?
do a much better job pipelining the reads, controlling our caches and
reducing write latency by having the reads done up in the OS instead of
the drive.
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today. If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically. The real question is can the FS make use of this layout
information *without* changing the page cache granularity? Only if you
answer me "no" to this do I think we need to worry about changing page
cache granularity.
Realistically, if you look at what the I/O schedulers output on a
standard (spinning rust) workload, it's mostly large transfers.
Obviously these are misalgned at the ends, but we can fix some of that
in the scheduler. Particularly if the FS helps us with layout. My
instinct tells me that we can fix 99% of this with layout on the FS + io
schedulers ... the remaining 1% goes to the drive as needing to do RMW
in the device, but the net impact to our throughput shouldn't be that
great.
James
sectors is to
create them properly aligned and use the actual, native sector size as
their FS
block size. Which is pretty much back the original challenge.
block filesystem's allocation algorithm tried to allocate on a 16k
boundary for instance, that gets us a lot of the performance without
needing a lot of alteration.
It's not even obvious that an ignorant 4k layout is going to be so
bad ... the RMW occurs only at the ends of the transfers, not in the
middle. If we say 16k physical block and average 128k transfers,
probabalistically we misalign on 6 out of 31 sectors (or 19% of the
time). We can make that better by increasing the transfer size (it
comes down to 10% for 256k transfers.
Teaching each and every file system to be aligned at the storageBut you're making assumptions about needing larger block sizes. I'm
granularity/minimum IO size when that is larger than the physical
sector size is
harder I think.
asking what can we do with what we currently have? Increasing the
transfer size is a way of mitigating the problem with no FS support
whatever. Adding alignment to the FS layout algorithm is another. When
you've done both of those, I think you're already at the 99% aligned
case, which is "do we need to bother any more" territory for me.