Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC

From: Kees Cook
Date: Mon Jun 02 2014 - 20:14:14 EST


On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>> point.
>>>>>>>>>>>>
>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>> installation time.
>>>>>>>>>>>>
>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>
>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>> being separate operations? If not, I think this would be easier to
>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>
>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>
>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>
>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>
>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>> and, if it succeeds, then you're done. It shouldn't fail unless out
>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>
>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>> filter being attached to it.
>>>>>>>
>>>>>>> I must admit this strikes me as odd. What's the point of having a
>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>
>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>
>>>>>>> In any case, I must have missed the ability for TSYNC to block. Hmm.
>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>
>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>> 3 could do:
>>>>>>
>>>>>> while (TSYNC-fails)
>>>>>> wait-on-or-kill-unexpected-thread
>>>>>>
>>>>>
>>>>> Ok.
>>>>>
>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>> harmlessly on error. FWIW, TSYNC is probably equivalent to adding an
>>>>> always-accept filter across all threads, although no one should really
>>>>> do the latter for efficiency reasons.
>>>>
>>>> Given the complexity of the locking, "fail" means "I applied the
>>>> change to all threads except for at least this one: *error code*",
>>>> which means looping with the "add-a-filter" method means all the other
>>>> threads keep getting filters added until there is full success. I
>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>
>>> Ugh, right.
>>>
>>>>
>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>> used in can be very weird, I really want to retain the retry ability.
>>>
>>> OK. So what's wrong with the other approach in which the
>>> add-to-all-threads option always succeeds? IOW, rather than requiring
>>> that all threads have the caller's filter as an ancestor, just add the
>>> requested filter to all threads. As an optimization, if the targetted
>>> thread has the current thread's filter as its filter, too, then the
>>> targetted thread can stay synchronized.
>>>
>>> That way the add filter call really is atomic.
>>>
>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>> the userspace interface could be kept as simple as possible. The fact
>>> that there's a filter hierarchy is sort of an implementation detail, I
>>> think.
>>
>> I'm totally on board with making this as simple as possible. :) The
>> corner cases are kind of horrible, though, but I think this is already
>> as simple as it can get.
>>
>> Externally, without the ancestry check, we can run the risk of have
>> unstable behavior out of a filter change. Imagine the case of a race
>> where a thread is adding a filter (via prctl), and the other thread
>> attempts to TSYNC a filter that blocks prctl.
>>
>> In the "always take the new filter" case, sometimes we get two filters
>> (original and TSYNCed) on the first thread, and sometimes it blows up
>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>> for the TSYNC caller to detect who won the race.
>>
>> With the patch series as-is, losing the race results in a TSYNC
>> failure (ancestry doesn't match). This is immediately detectable and
>> the caller can the decide how to handle the situation directly.
>>
>> Regardless, I don't think the filter hierarchy is an implementation
>> detail -- complex filters take advantage of the hierarchies. And
>> keeping this hierarchy stable means the filters are simpler to
>> validate for correctness, etc.
>
> Hmm.
>
> Another issue: unless I'm not looking at the right version of this, I
> don't think that while_each_thread does what you think it does.

I believe it to walk current's thread_group list. It starts at current
and walks until it sees current again. While holding task_list for
writing, no new threads can appear on the list. Any cloned threads not
yet in the thread_group list will be added once task_lock is held by
the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
pid);").

> I'm generally in favor of making the kernel interfaces easy to use,
> even at the cost of a bit of implementation complexity. Would this be
> simpler if you hijacked siglock to protect seccomp state? Then I
> think you could just make everything atomic.

The sighand structure isn't unique to the task nor the thread group
(CLONE_SIGHAND will leave the same sighand attached to a new thread),
so it can't be used sanely here AIUI. The threadgroup_lock stuff could
be used here, but I felt like it was too heavy a hammer for
per-add-filter actions.

The atomicity you're after is just avoiding the "I failed but some
threads may have changed" side-effect, IIUC. As in, you want "I failed
and no threads have changed". For that, I think we need to use
threadgroup_lock. However, I remain unconvinced that this is the right
way to go since the failure corner case needs to be handled by the
tsync caller in one of two ways regardless of either perfect "failed,
no threads changed" or existing series "failed, some threads changed":
retry until success or kill entire process.

-Kees

--
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/