Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocateUAPI

From: Howard Chu
Date: Sat Dec 08 2012 - 08:52:47 EST


Dave Chinner wrote:
On Fri, Dec 07, 2012 at 03:25:53PM -0800, Howard Chu wrote:
I have to agree that, if this is going to be an ext4-specific
feature, then it can just be implemented via an ext4-specific ioctl
and be done with it. But I'm not convinced this should be an
ext4-specific feature.

As for "fix the problem properly" - you're fixing the wrong problem.
This type of feature is important to me, not just because of the
performance issue. As has already been pointed out, the performance
difference may even be negligible.

But on SSDs, the issue is write endurance. The whole point of
preallocating a file is to avoid doing incremental metadata updates.
Particularly when each of those 1-bit status updates costs entire
blocks, and gratuitously shortens the life of the media. The fact
that avoiding the unnecessary wear and tear may also yield a
performance boost is just icing on the cake. (And if the perf boost
is over a factor of 2:1 that's some pretty damn good icing.)

That's a filesystem implementation specific problem, not a generic
fallocate() or unwritten extent conversion problem.

Besides, ext4 doesn't write back every metadata modification that is
made - they are aggregated in memory and only written when the
journal is full or the metadata ages out. Hence unwritten extent
conversion has very little impact on the amount of writes that are
done to the flash because it is vastly dominated by the data writes.

Similarly, in XFS you might see a few thousand or tens of thousands
of metadata blocks get written once every 30s under such a random
write workload, but each metadata block might have gone through a
million changes in memory since the last time it was written.
Indeed, in that 30s, there would have been a few million random data
writes so the metadata writes are well and truly lost in the
noise...

That's only true if write caching is allowed. If you have a transactional database running, it's syncing every transaction to media.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/