Re: [PATCH 1/2] pidmap(2)

From: Djalal Harouni
Date: Thu Sep 07 2017 - 01:06:08 EST


Hi Alexey,

On Thu, Sep 7, 2017 at 4:04 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Wed, Sep 6, 2017 at 2:04 AM, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
>> On 9/6/17, Randy Dunlap <rdunlap@xxxxxxxxxxxxx> wrote:
>>> On 09/05/17 15:53, Andrew Morton wrote:
[...]
>>>
>>> also, I expect that the tiny kernel people will want kconfig options for
>>> these syscalls.
>>
>> We'll add it but the question if it is a good idea. Ideally these system calls
>> should be mandatory and /proc optional.
>>
>> $ size kernel/pidmap.o fs/fdmap.o
>> text data bss dec hex filename
>> 560 0 0 560 230 kernel/pidmap.o
>> 617 0 0 617 269 fs/fdmap.o
>
> After much discussion at LPC/KS last year, I thought the idea was to
> try to speed up /proc rather than replacing it outright. The two
> specific ideas I recall were:
>
> 1. Add a syscall like readfileat() that you can use to, in a single
> operation, open, read, and close a /proc file (or other file). This
> should vastly reduce locking and RCU overhead.
>
> 2. Add a /proc file that has a nice binary format for task info. (nl_attr?)
>
> I don't see why pidmap() deserves to be significantly faster than getdents().
>
> Also, a pidmap() syscall like this inherently bypasses any security
> restrictions implied by the way that /proc is mounted. It can respect
> hidepid, but hidepid (as a per-namespace concept) is an enormous turd
> that badly needs to be deprecated, and Djalal is working on exactly
> that.

Yes as noted by Andy, me and Alexey Gladkov are working on modernizing
procfs [1] and to reduce/remove ties within pid namespaces which has lot
of problems now.

We just picked the task again, and this was the result of discussion
with Andy some months ago, on how to improve hidepid, but also how to
improve procfs in general, so we can add other mechanisms to hide or return
NULL on other /proc/_file_not_needed_by_containers_ or
/proc/_specific_module_files_ everything that is not virtualized , or mount only
some specific view of the whole /proc API this will also be used by containers.
This also should make it hard for attackers since we are planning to have
a backward compatible options on how to better treat some of these files in
regard of some namespaces.

The syscall or readfileat() for one operation is a nice addition
definitively. But
in general it would be better to treat /proc as a filesystem and not add other
specific interfaces that may abstract it with pidns, as it is the situation now
which make it from userspace perspective: hard to use especially for security
context.

Alexey, could you please Cc'us on future, thank you very much!


[1] https://lkml.org/lkml/2017/4/25/282

--
tixxdz