Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

From: Thomas Schoebel-Theuer
Date: Sat Mar 12 2016 - 05:18:25 EST

On 03/12/2016 08:19 AM, Theodore Ts'o wrote:
On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:

There's a big difference between "give the user rope", and "tie the
rope in a noose and put a banana peel so that the user might stumble
into the rope and hang himself", though.
[...] And then the application has to run
setgid with that group's privileges.

Your concept of hierarchically nesting containers via filesystem instances looks nice to me.

A potential concern could be whether gids are the right implementation for expressing hierarchically nested access permissions in a persistent way.

Your permissions attached to gids are nested (because inside of your containers you may have another instance of a completely different gid namespace), they are also persistent when your mount flags etc are restored properly after a crash (by some scripts), but probably use of gids for this might look like a kind of "misuse" of the original gid concept from the 1970s.

Maybe you currently don't have a better /persistent/ concept for expressing your needs, so maybe your solution could be just fine under the currently given cirumstances.

Introduction of a new concept for overcoming the current limitations must be done very carefully.

The bad discard semantics concerns about information leaks could be /hypothetically/ solved at /concept level/ in the following way. Please note that by "concept level" I don't want to imply any particular implementation, this is just a mental experiment for discussion of the problems, just a "model of thinking":

a) Use a hierarchical namespace for naming subjects, e.g. hypervisorA.containerB.subcontainerC.user9 instead of gid=9

b) Attach actual permissions to each block of the underlying block device (fine-grained object model).

c) Correctly maintain access rights at each hierarchical layer, and for all operations (including discard with whatever semantics). In case some inner instance is untrusted and may do evil things, this will be intercepted / corrected at outer layers (which are more trusted). In essence, the nesting hierarchy is also a hierarchy of trust.

Now information leaks by bad discard semantics etc should be solved at any level, even regarding completely unrelated containers or users, as long as no physical access to the disk is possible. In addition, encryption may be used for even overcoming this.

Of course, a direct implementation of such extremely fine-grained access permissions would carry way too much overhead. Both the number of subjects as well as the number of objects must be reduced to some reasonable order of magnitude, at least at outer levels.

Thus the question is: how can we achieve almost the same effect with much less overhead?

Hmm, in my old Athomux research prototype, I proposed some solutions for this, on an academic green meadow. But I am unsure what is transferable to a standard POSIX semantics system, and what not. Rethinking these concepts as well as checking them may take some time....

Here is a first alpha-stage attempt:

1) Give up the hierarchical subject namespace a), but maybe not fully. Access checking will continue /locally/ at each layer, by treating each subsystem as a (grey) blackbox. This is already the default implementation strategy. The total system may be less secure than in an idealized fine-grained system, because outer levels can no longer detect bad guys inside of their subsystem instances. The question is: how to get a "more secure" system than currently, with some reasonable effort.

2) Some /coarse/ access permission checks at the block layer b), but finer than today. Currently there is almost no checking at all (except when accessing a huge block device as a whole during open() => at 1&1 we have very large ones, and they may continue running for years). I am unsure how to achieve this in detail.

An idea for a long-term solution would be offloading of "allocation groups" to the block layer (if their size is coarsely dynamic in general, e.g. in steps of gigabytes), and to implement some coarse permission checks there. These could then be related to "containers" or "container groups". One of the problems is that some wide-spread network protocols like iSCSI have no clue about this, so this can only be an optional new feature.

Further ideas sought.

Cheers, Thomas

P.S. The concept of a "nest" in Athomux was already some kind of "recursively nested block device".