Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Eric W. Biederman
Date: Wed Oct 22 2014 - 13:41:24 EST
David Drysdale <drysdale@xxxxxxxxxx> writes:
> On Tue, Oct 21, 2014 at 5:29 AM, Eric W. Biederman
> <ebiederm@xxxxxxxxxxxx> wrote:
>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>
>>> On Mon, Oct 20, 2014 at 6:48 AM, David Drysdale <drysdale@xxxxxxxxxx> wrote:
>>>> On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
>>>> <ebiederm@xxxxxxxxxxxx> wrote:
>>>>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>>>>
>>>>>> [Added Eric Biederman, since I think your tree might be a reasonable
>>>>>> route forward for these patches.]
>>>>>>
>>>>>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <drysdale@xxxxxxxxxx> wrote:
>>>>>>> Resending, adding cc:linux-api.
>>>>>>>
>>>>>>> Also, it may help to add a little more background -- this patch is
>>>>>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>>>>>
>>>>>>> Capsicum is a security framework that has been present in FreeBSD since
>>>>>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>>>>>> security [1].
>>>>>>>
>>>>>>> One of the features of Capsicum is capability mode, which locks down
>>>>>>> access to global namespaces such as the filesystem hierarchy. In
>>>>>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>>>>>> work -- hence the need for a kernel-space
>>>>>>
>>>>>> I just found myself wanting this syscall for another reason: injecting
>>>>>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>>>>>
>>>>>> For example, I want to be able to reliably do something like nsenter
>>>>>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>>>>>> it is more or less fully functional, so this should Just Work (tm),
>>>>>> except that the toybox binary might not exist in the namespace being
>>>>>> entered. If execveat were available, I could rig nsenter or a similar
>>>>>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>>>>>> execveat.
>>>>>>
>>>>>> Is there any reason that these patches can't be merged more or less as
>>>>>> is for 3.19?
>>>>>
>>>>> Yes. There is a silliness in how it implements fexecve. The fexecve
>>>>> case should be use the empty string "" not a NULL pointer to indication
>>>>> that. That change will then harmonize execveat with the other ...at
>>>>> system calls and simplify the code and remove a special case. I believe
>>>>> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>>>>
>>>> Good point -- I'll shift to "" + AT_EMPTY_PATH.
>>>
>>> Pending a better idea, I would also see if the patches can be changed
>>> to return an error if d_path ends up with an "(unreachable)" thing
>>> rather than failing inexplicably later on.
>>
>> For my reference we are talking about
>>
>>> @@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
>>> sched_exec();
>>>
>>> bprm->file = file;
>>> - bprm->filename = bprm->interp = filename->name;
>>> + if (filename && fd == AT_FDCWD) {
>>> + bprm->filename = filename->name;
>>> + } else {
>>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>> + if (!pathbuf) {
>>> + retval = -ENOMEM;
>>> + goto out_unmark;
>>> + }
>>> + bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
>>> + if (IS_ERR(bprm->filename)) {
>>> + retval = PTR_ERR(bprm->filename);
>>> + goto out_unmark;
>>> + }
>>> + }
>>> + bprm->interp = bprm->filename;
>>>
>>> retval = bprm_mm_init(bprm);
>>> if (retval)
>>
>> The interesting case for fexecve is when we either don't know what files
>> are present or we don't want to depend on which files are present.
>>
>> As Al pointed out d_path really isn't the right solution. It fails when
>> printing /proc/self/fd/${fd}/${filename->name} would work, and the
>> "(deleted)" or "(unreachable)" strings are wrong.
>>
>> The test for today's cases should be:
>> if ((filename->name[0] == '/') || fd == AT_FDCWD) {
>> bprm->filename = filename->name;
>> }
>>
>> To handle the case where the file descriptor is relevant.
> (s/relevant/irrelevant)
>
> Yep, good spot.
>
>> For the case where the file descriptor is relevant let me suggest
>> setting bprm->filename and bprm->interp to:
>>
>> /dev/fd/${fd}/${filename->name}
>
> I'll send out an updated patchset with this approach, but I have a slight
> reservation. Given that /dev/fd is a symlink to /proc/self/fd, this approach
> means that script invocations will always fail on a /proc-less system,
> where the previous iteration might have worked.
>
> (As it happens, this isn't a restriction that affects the things I'm
> working on, as Capsicum wouldn't allow script invocation anyway.
> However, scenarios without /proc were nominally one of the motivating
> factors for execveat in the first place...)
Which is where's Al Viro's and Peter Anvin's conversation about a
minimal filesystem that can serve the needs of /proc/self/fd comes in.
There are uses for execveat with static executables, so I think execveat
is justified. But having a dupfs that we could potentially mount on
/dev/fd would be interesting. As it is much less of a security
concern than /proc with all of the interfaces it provides.
>> It is more a description of what we have done but as a magic string it
>> is descriptive. Documetation/devices.txt documents that /dev/fd/ should
>> exist, making it an unambiguous path. Further these days the kernel
>> sets the device naming policy in dev, so I think we are strongly safe in
>> using that path in any event.
>>
>> I think execveat is interesting in the kernel because the motivating
>> cases are the cases where anything except a static executable is
>> uninteresting.
>
> FYI, there is potential in the future for something other than static
> executables -- the FreeBSD Capsicum implementation includes changes
> to the dynamic linker to get its search path as a list of pre-opened dfds
> (in LD_LIBRARY_PATH_FDS) rather than paths.
Which still leaves open the question how do you find the dynamic
linker. Is that also a pre-opened dfd?
Using /dev/fd/$N is also the kind of thing that a shell or a script
interpret could special case instead relying on a filesystem node
to exist.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/