Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC

From: Andy Lutomirski
Date: Mon Jun 02 2014 - 21:15:54 EST


On Mon, Jun 2, 2014 at 6:09 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> On Mon, Jun 2, 2014 at 5:29 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>>>> being separate operations? If not, I think this would be easier to
>>>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>>>
>>>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>>>
>>>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>>>
>>>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>>>> and, if it succeeds, then you're done. It shouldn't fail unless out
>>>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>>>
>>>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>>>> filter being attached to it.
>>>>>>>>>>
>>>>>>>>>> I must admit this strikes me as odd. What's the point of having a
>>>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>>>
>>>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>>>
>>>>>>>>>> In any case, I must have missed the ability for TSYNC to block. Hmm.
>>>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>>>
>>>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>>>> 3 could do:
>>>>>>>>>
>>>>>>>>> while (TSYNC-fails)
>>>>>>>>> wait-on-or-kill-unexpected-thread
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok.
>>>>>>>>
>>>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>>>> harmlessly on error. FWIW, TSYNC is probably equivalent to adding an
>>>>>>>> always-accept filter across all threads, although no one should really
>>>>>>>> do the latter for efficiency reasons.
>>>>>>>
>>>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>>>> change to all threads except for at least this one: *error code*",
>>>>>>> which means looping with the "add-a-filter" method means all the other
>>>>>>> threads keep getting filters added until there is full success. I
>>>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>>>
>>>>>> Ugh, right.
>>>>>>
>>>>>>>
>>>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>>>
>>>>>> OK. So what's wrong with the other approach in which the
>>>>>> add-to-all-threads option always succeeds? IOW, rather than requiring
>>>>>> that all threads have the caller's filter as an ancestor, just add the
>>>>>> requested filter to all threads. As an optimization, if the targetted
>>>>>> thread has the current thread's filter as its filter, too, then the
>>>>>> targetted thread can stay synchronized.
>>>>>>
>>>>>> That way the add filter call really is atomic.
>>>>>>
>>>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>>>> the userspace interface could be kept as simple as possible. The fact
>>>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>>>> think.
>>>>>
>>>>> I'm totally on board with making this as simple as possible. :) The
>>>>> corner cases are kind of horrible, though, but I think this is already
>>>>> as simple as it can get.
>>>>>
>>>>> Externally, without the ancestry check, we can run the risk of have
>>>>> unstable behavior out of a filter change. Imagine the case of a race
>>>>> where a thread is adding a filter (via prctl), and the other thread
>>>>> attempts to TSYNC a filter that blocks prctl.
>>>>>
>>>>> In the "always take the new filter" case, sometimes we get two filters
>>>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>>>> for the TSYNC caller to detect who won the race.
>>>>>
>>>>> With the patch series as-is, losing the race results in a TSYNC
>>>>> failure (ancestry doesn't match). This is immediately detectable and
>>>>> the caller can the decide how to handle the situation directly.
>>>>>
>>>>> Regardless, I don't think the filter hierarchy is an implementation
>>>>> detail -- complex filters take advantage of the hierarchies. And
>>>>> keeping this hierarchy stable means the filters are simpler to
>>>>> validate for correctness, etc.
>>>>
>>>> Hmm.
>>>>
>>>> Another issue: unless I'm not looking at the right version of this, I
>>>> don't think that while_each_thread does what you think it does.
>>>
>>> I believe it to walk current's thread_group list. It starts at current
>>> and walks until it sees current again. While holding task_list for
>>> writing, no new threads can appear on the list. Any cloned threads not
>>> yet in the thread_group list will be added once task_lock is held by
>>> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
>>> pid);").
>>
>> Oh. I see. I think the intended use is to pair do_each_thread with
>> while_each_thread, which has a rather different effect.
>> for_each_thread might be clearer.
>
> I can swap it out. It doesn't seem to be any different.
>
>>>> I'm generally in favor of making the kernel interfaces easy to use,
>>>> even at the cost of a bit of implementation complexity. Would this be
>>>> simpler if you hijacked siglock to protect seccomp state? Then I
>>>> think you could just make everything atomic.
>>>
>>> The sighand structure isn't unique to the task nor the thread group
>>> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
>>> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
>>> be used here, but I felt like it was too heavy a hammer for
>>> per-add-filter actions.
>>
>> I'm going by this comment:
>>
>> /*
>> * NOTE! "signal_struct" does not have its own
>> * locking, because a shared signal_struct always
>> * implies a shared sighand_struct, so locking
>> * sighand_struct is always a proper superset of
>> * the locking of signal_struct.
>> */
>> struct signal_struct {
>>
>> copy_process protects addition to signal->thread_head using siglock,
>> which seems like a suitable lock for you to use. And it's there
>> regardless of configuration.
>
> Ah, I see now that CLONE_THREAD must include CLONE_SIGHAND, and
> CLONE_SIGHAND must include CLONE_VM. So, we can depend on this being a
> single thread-group-wide lock! Excellent.
>
>>> The atomicity you're after is just avoiding the "I failed but some
>>> threads may have changed" side-effect, IIUC. As in, you want "I failed
>>> and no threads have changed". For that, I think we need to use
>>> threadgroup_lock. However, I remain unconvinced that this is the right
>>> way to go since the failure corner case needs to be handled by the
>>> tsync caller in one of two ways regardless of either perfect "failed,
>>> no threads changed" or existing series "failed, some threads changed":
>>> retry until success or kill entire process.
>>
>>
>> With the current patches, I think you need:
>>
>> add filter to this thread;
>> while (trying to sync fails) {
>> grumble;
>> sleep a bit;
>> }
>>
>> If the "failed, some threads changed" case goes away, then I think it can be:
>>
>> while (trying to add a new filter to all threads fails) {
>> grumble;
>> sleep a bit;
>> }
>
> You're still looking to get rid of the stand alone tsync action. It
> does seem unneeded here if the filter-add can fail without
> side-effects. A new prctl or syscall is still needed to handle this,
> regardless.
>
>> which is nicer. It could be fancier:
>>
>> force add new filter to all threads; (i.e. no loop)
>>
>> where force-add does, roughly:
>>
>> acquire siglock
>> old = current's filter
>> add the filter to this thread
>> new = current's new filter
>> for each other thread {
>> if (thread's filter == old)
>> thread's filter = new (w/ appropriate refcounting)
>> else
>> add filter to thread
>> }
>
> The trick is the "w/ appropriate refcounting" part. :) Also, "add
> filter to thread" means performing memory allocation, which we can't
> do while holding the siglock spinlock IIUC. I remain uncomfortable
> with the composition idea, because it's not really a "sync", it's a
> "make sure *this* filter is attached" instead of "make sure everyone
> is using my entire current filter chain".
>
>> IOW, if you are already adding filters to library threads, I think
>> it'll be more convenient and robust to to add it everywhere than to
>> exempt threads that have added their own filter. If they've added
>> their own filter, we won't override it; we'll just compose. It'll be
>> the same, modulo filter order, as if the library thread adds its
>> filter *after* we add ours.
>
> This is why I don't like it: it's not sync. It's force-apply. The goal
> was to be able to build up a filter tree with potentially multiple
> calls, and in the future say "okay, now, make sure any threads that
> got started in the middle of that are all moved forward to the same
> filter chain position I'm on."
>
> The force-apply route means all the filters must be force-loaded, and
> doesn't handle, say, a library that doesn't start a thread but does
> apply a filter running, and then we apply another with force-apply,
> and suddenly all the other threads are missing the filter added by the
> library.

This is racy, though: what if you sync before the library adds its filter?

>
> I think sync should be sync -- it's the desired functionality that
> Chrome needs. Force-apply is different, and could be implemented at a
> later time if someone wants it, but for now, I'd like a literal
> thread-sync.

>
>> Yes, I realize there are oddities here if one of the filters blocks
>> seccomp prctls, but I think there are oddities here regardless. I
>> certainly see the value in the non-force variant.
>
> So, it sounds like if I can build the side-effect-less failure case,
> you'll be happy? I really do not want to do the force-apply. Can you
> live without that?

Sure. I can always add it if I find myself wanting it. Are you
planning on changing the locking and making sync (or
add-to-all-threads-not-forced) atomic?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/