Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

From: Djalal Harouni
Date: Wed Feb 15 2017 - 04:34:01 EST


On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes:
>
>> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>>> > > wrote:
>>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>>> > > > wrote:
>>> > > > > > Another option would be to require something like a
>>> > > > > > project as used for project quotas as the root. This would
>>> > > > > > also be conveniant as it could storge the used remapping
>>> > > > > > tables.
>>> > > > >
>>> > > > > So this would be like the current project quota except set on
>>> > > > > a subtree? I could see it being done that way but I don't
>>> > > > > see what advantage it has over using flags in the subtree
>>> > > > > itself (the mapping is known based on the mount namespace, so
>>> > > > > there's really only a single bit of information to store).
>>> > > >
>>> > > > projects (which are the underling concept for project quotas)
>>> > > > are per-subtree in practice - the flag is set on an inode and
>>> > > > then all directories and files underneath inherit the project
>>> > > > ID, hardlinking outside a project is prohinited.
>>> > >
>>> > > I'm interested in having a VFS-level way to do more than just a
>>> > > shift; I'd like to be able to arbitrarily remap IDs between
>>> > > what's on disk and the system IDs.
>>> >
>>> > OK, so the shift is effectively an arbitrary remap because it
>>> > allows multiple ranges to be mapped (althought the userns currently
>>> > imposes a maximum number of five extents but that limit is a bit
>>> > arbitrary just to try to limit the amount of space the
>>> > parametrisation takes). See
>>> > kernel/user_namespace.c:map_id_up/down()
>>> >
>>> > > If we're talking about developing a VFS-level solution for
>>> > > this, I'd like to avoid limiting it to just a shift. (A
>>> > > shift/range would definitely be the simplest solution for many
>>> > > common container cases, but not all.)
>>> >
>>> > I assume the above satisfies you on this point, but raises the
>>> > question: do you want an arbitrary shift not parametrised by a user
>>> > namespace? If so how many such shifts do you want ... giving some
>>> > details of the use case would be helpful.
>>>
>>> The limit of five extents means this may not work in the most general
>>> case, no.
>>
>> That's not an API limit, so it can be changed if there's a need. The
>> problem was merely how to parametrise a mapping without taking too much
>> space.
>>
>>> One use case: given an on-disk filesystem, its name-to-number
>>> mapping, and your host name-to-number mapping, mount the filesystem
>>> with all the UIDs bidirectionally mapped to those on your host
>>> system.
>>
>> This is pretty much what the s_user_ns does.
>>
>>> Another use case: given an on-disk filesystem with potentially
>>> arbitrary UIDs (not necessarily in a clean contiguous block), and a
>>> pile of unprivileged UIDs, mount the filesystem such that every on
>>> -disk UID gets a unique unprivileged UID.
>>
>> So is this. Basically anything that begins by mounting gets a super
>> block and can use the s_user_ns to map from the filesystem view to the
>> kernel view of ids. Apart from greater sophistication in the
>> parametrisation, it sounds like we have all the machinery you need.
>> I'm sure the containers people will consider reasonable patches to
>> change this.
>
> Yes.
>
> And to be clear we have all of that merged now and mostly present and
> hooked up in all filesystems without any shiftfs like changes needed.
>
> To use this with a filesystem a last pass needs to be had to verify that
> the cases where something does not map are handled cleanly.

Still this does not answer the question how to dynamically
*attach/share* data or read-only volumes as defined by
orchestration/container tools into several containers. Am I missing
something or is the plan to have per superblock mount for each one ?

--
tixxdz