Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

From: Gregory Farnum
Date: Thu Mar 17 2016 - 01:18:37 EST


On Wed, Mar 16, 2016 at 5:33 PM, Eric Sandeen <esandeen@xxxxxxxxxx> wrote:
> I may have lost the thread at this point, with poor Darrick's original
> patch submission devolving into a long thread about a NO_HIDE_STALE patch
> used at Google, but I don't *think* Ceph ever asked for NO_HIDE_STALE.
>
> At least I can't find any indication of that.
>
> Am I missing something? cc'ing Greg on this one in case I am.

Brief background:
Ceph currently has two big local storage subsystems: FileStore and
BlueStore. FileStore is the one that's been around for forever and is
currently stable/production-ready/bla bla bla. This one represents
RADOS objects as actual files and while it's *mostly* just converting
object operations into posix FS ones, it does rely on a few pieces of
the fs namespace and posix ops to do its work.
BlueStore is our new, pure userspace solution (Sage started this about
8 months ago, I think?). It started out using xfs basically as a block
allocator, but at this point it's just doing raw block access 100% in
userspace.

So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
it was one of the problems Sage had using xfs in his BlueStore
implementation and was a big part of why it moved to pure userspace.
FileStore might use NO_HIDE_STALE in some places but it would be
pretty limited. When it came up at Linux FAST we were discussing how
it and similar things had been problems for us in the past and it
would've been nice if they were upstream. What *is* a big deal for
FileStore (and would be easy to take advantage of) is the thematically
similar O_NOMTIME flag, which is also about reducing metadata updates
and got blocked on similar stupid-user grounds (although not security
ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
As noted though, we've basically given up and are moving to a
pure-userspace solution as quickly as we can. So no, Ceph isn't likely
to be a big user of these interfaces as it's too late for us. Adding
them would be an investment for future distributed storage systems
more than current ones. Maybe that's not worth it, or maybe there are
better places to keep them in the kernel. (I think I saw a reference
to some hypothetical shared block allocator? That would be *awesome*.)

=========
Separately. In the particular case of the extents and data leaks, a
coworker of mine suggested you could tag any files which *ever* had
unwritten extents with something that prevents them being read by a
user who doesn't have raw block access (and, even better, let us apply
that flag on file create)...that's a weird new security rule for
people to know and requires space for tagging (no idea how bad that
is), but would work in any use cases we have and would not leak
anything the user doesn't already have access to.
-Greg