Re: [PATCH] inotify: Convert to using per-namespace limits

From: Eric W. Biederman
Date: Mon Oct 10 2016 - 19:08:27 EST


Nikolay Borisov <kernel@xxxxxxxx> writes:

> On Mon, Oct 10, 2016 at 11:49 PM, Eric W. Biederman
> <ebiederm@xxxxxxxxxxxx> wrote:
>> Jan Kara <jack@xxxxxxx> writes:
>>
>>> On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
>>>> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
>>>> > Nikolay Borisov <kernel@xxxxxxxx> writes:
>>>> >
>>>> >> This patchset converts inotify to using the newly introduced
>>>> >> per-userns sysctl infrastructure.
>>>> >>
>>>> >> Currently the inotify instances/watches are being accounted in the
>>>> >> user_struct structure. This means that in setups where multiple
>>>> >> users in unprivileged containers map to the same underlying
>>>> >> real user (i.e. pointing to the same user_struct) the inotify limits
>>>> >> are going to be shared as well, allowing one user(or application) to exhaust
>>>> >> all others limits.
>>>> >>
>>>> >> Fix this by switching the inotify sysctls to using the
>>>> >> per-namespace/per-user limits. This will allow the server admin to
>>>> >> set sensible global limits, which can further be tuned inside every
>>>> >> individual user namespace.
>>>> >>
>>>> >> Signed-off-by: Nikolay Borisov <kernel@xxxxxxxx>
>>>> >> ---
>>>> >> Hello Eric,
>>>> >>
>>>> >> I saw you've finally sent your pull request for 4.9 and it
>>>> >> includes your implementatino of the ucount infrastructure. So
>>>> >> here is my respin of the inotify patches using that.
>>>> >
>>>> > Thanks. I will take a good hard look at this after -rc1 when things are
>>>> > stable enough that I can start a new development branch.
>>>> >
>>>> > I am a little concerned that the old sysctls have gone away. If no one
>>>> > cares it is fine, but if someone depends on them existing that may count
>>>> > as an unnecessary userspace regression. But otherwise skimming through
>>>> > this code it looks good.
>>>>
>>>> So this indeed this is real issue and I meant to write something about
>>>> it. Anyway, in order to preserve those sysctl what can be done is to
>>>> hook them up with a custom sysctl handler taking the ns from the proc
>>>> mount and the euid of current? I think this is a good approach, but
>>>> let's wait and see if anyone will have objections to completely
>>>> eliminating those sysctls.
>>>
>>> Well, I believe just discarding those sysctls is not an option - I'm pretty
>>> sure there are scripts out there which tune these sysctls and those would
>>> stop working. IMO not acceptable regression.
>>
>> Nikolay there is your objection.
>>
>> So since it should be straight forward let's preserve the existing
>> sysctls. Then this change doesn't need to prove there are no scripts
>> that tweak those sysctls.
>>
>> We are just talking changing the values in the initial user namespace so
>> it should be completely compatible and straight forward to implement
>> unless I am missing something.
>
> Well I'm not so sure about this. Let's say those sysctls are going to
> modify the ucount values in the init_user_ns. That's fine, however for
> which particular user should they do this ? Should it be hardcoded for
> kuid 0? or current_euid? I personally think they should be changing
> the values for the current_euid.

Unless I have missed something the limits are per user namespace. The
counts are per user in that namespace. Certainly that is what the rest
of the ucount infrastructure is doing.

At which point having the existing sysctls simply update the limit in
the initial user namespace should result in no change.

Eric