Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
From: Darrick J. Wong
Date: Wed Mar 16 2016 - 22:42:44 EST
On Thu, Mar 17, 2016 at 12:01:16PM +1100, Dave Chinner wrote:
> On Tue, Mar 15, 2016 at 06:51:39PM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote:
> > > On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
> > > >
> > > > Stale data escaping containment is a security issue. Enabling
> > > > generic kernel mechanisms to *enable containment escape* is
> > > > fundamentally wrong, and relying on userspace to Do The Right Thing
> > > > is even more of a gamble, IMO.
> > >
> > > We already have generic kernel mechanisms such as "the block device". P
> > >
> > > > It's a practical concern because if we enable this functionality in
> > > > fallocate because it will get used by more than just special storage
> > > > apps. i.e. this can't be waved away with "only properly managed
> > > > applications will use it" arguments.
> > >
> > > It requires a mount option. How is this going to allow random
> > > applications to use this feature, again?
> > >
> > > > I also don't make a habit of publicising the fact that since we
> > > > disabled the "-d unwritten=X" mkfs parameter (because of speed racer
> > > > blogs such as the above and configuration cargo-culting resulting in
> > > > unsuspecting users exposing stale data unintentionally) that the
> > > > functionality still exists in the kernel code and that it only takes
> > > > a single xfs_db command to turn off unwritten extents in XFS. i.e.
> > > > we can easily make fallocate on XFS expose stale data, filesystem
> > > > wide, without requiring mount options, kernel or application
> > > > modifications.
> > >
> > > So you have something even more dangerous in XFS and it's in the
> > > kernel tree? Has Red Hat threatened to have a distro-specific patch
> >
> > xfs_db is the XFS debugger, so you can only enable that bit of functionality
> > with magical commands, which IMHO isn't much different than people messing
> > with their ext4 filesystems with debugfs. You had better know what you're
> > doing and if you break the filesystem you can eat both pieces. :P
> >
> > > to comment out this code to make sure irresponsible users can't use
> > > it? What I've been suggesting has even more controls that what you
> > > have. And I've been keeping it as an out-of-tree kernel patch mainly
> > > because you've been arguing that it's such a horrible thing.
> >
> > One could lock it down even more -- hide it behind a Kconfig option that
> > depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option,
> > only allow the file owner to call no-hide-stale and only if the file is 0600
> > (or the appropriate group equivalents like Ted's existing patch), and upon
> > adding stale extents, set an inode flag that locks uid/gid/mode/flags.
> > Obviously root can still get to the file, but at least there's hard evidence
> > that one is entering the twilight zone.
> >
> > > > Making Google's hack more widely available through the fallocate
> > > > API is entirely dependent on proving that:
> > >
> > > Ceph is about to completely bypass the file system because of your
> > > intransigence, and reimplement a userspace file system. They seem to
> > > believe it's necessary. I'll let them make the case, because they
> > > seem to think it's necessary. And if not, if Linus sides with you,
> > > and doesn't want to take the patch, I'll just keep it as a
> > > Google-specific out-of-tree patch. I don't *need* to have this thing
> > > upstream.
> >
> > Frankly, I second Eric Sandeen's comments -- just how bad is ext4's
> > unwritten extent conversion for these folks?
> >
> > I ran this crappy looping microbenchmark against a ramdisk:
> > fallocate 400M
> > write 400M
> > fsync
> > rewrite the 400M
> > fsync
> > on kernel 4.5.
> >
> > For writing 400M through the page cache in 4k chunks,
> > ext4: ~460MB/s -> ~580MB/s (~20%)
> > XFS: ~660 -> ~870 (~25%)
> > btrfs: ~130 -> ~200 (~35%)
> >
> > For writing 400M in 80M chunks,
> > ext4: ~480MB/s -> ~720MB/s (~30%)
> > XFS: ~1GB/s -> ~1.5GB/s (~35%)
> > btrfs: ~590MB/s -> ~590MB/s (no change)
> >
> > For directio writing 400MB in 4k chunks,
> > ext4: 25MB/s -> 26MB/s (~5%)
> > XFS: 25 -> 27 (~8%)
> > btrfs: 22 -> 18 (...)
> >
> > For directio writing 1200MB in 80M chunks,
> > ext4: ~2.9GB/s -> ~3.3GB/s (~13%)
> > XFS: 3.2 -> 3.5 (~9%)
> > btrfs: 2.3 -> 2.2 (...)
>
> That's not comparing apples to apples. overwrite does not require
> allocation/extent manipulation at all so it is comparing performance
> of completely different file extent operations. The operations we
> should be comparing are "first write" operations, namely "write over
> hole with allocation" vs "write over preallocation". Overwrite
> performance should be the same regardless of the method used for the
> intial write/allocation.
Eh, ok, let's compare writing an fallocated region vs writing an empty file:
(laptop this time, so the numbers aren't the same)
For writing 400M in 4k chunks,
ext4: ~720MB/s -> 620MB/s (~14%)
XFS: ~560MB/s -> 540MB/s (~4%)
btrfs: ~260 -> 580MB/s
For writing 400M in 80M chunks,
ext4: ~960MB/s -> ~730MB/s (~24%)
XFS: ~1000MB/s -> ~980MB/s (~2%)
btrfs: ~950MB/s -> ~930MB/s (~3%)
For directio writing 1200MB in 80M chunks,
ext4: ~2.9GB/s -> ~2.8GB/s (~4%)
XFS: 3.2 -> 3.2
btrfs: 2.3 -> 2.3
--D