Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is andhow it works.

From: Will Drewry
Date: Fri May 06 2011 - 21:58:23 EST


On Fri, May 6, 2011 at 4:53 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> On Thu, 2011-05-05 at 02:21 -0700, Will Drewry wrote:
>
>> In particular, if the userspace code wants to stage some filters and
>> apply them all at once, when ready, I'm not sure that it makes sense
>> to me to put that complexity in the kernel itself.  For instance,
>> Eric's second sample showed a call that took an array of ints and
>> coalesced them into "fd == %d || ...".  That simple example shows that
>> we could easily get by with a pretty minimal kernel-supported
>> interface as long as the richer behavior could live userspace side --
>> even if just in a simple helper library.  It'd be pretty easy to
>> implement a userspace library that exposed add_filter(syscall_nr,
>> filter) and apply_filters() such that it could manage building the
>> final filter string for a given syscall and pushing it to prctl on
>> apply.
>
> I'm fine with a single kernel call and the "temporary filter" be done in
> userspace. Making the kernel code less complex is better :)
>
>>
>> I think that could also help simplify the primitives.  For instance,
>> if any separate SET called on a system call resulting in an &&
>> operation, then the behavior could be consistent prior to enforcement
>> of the filtering and after.  E.g.,
>>   SET, __NR_read, "fd == 1"
>>   SET, __NR_read, "len < 4097"
>> would result in an evaluated "fd == 1 && len < 4097".  It would do so
>> after a single APPLY call too:
>>   SET, __NR_read, "1"
>>   APPLY
>>   SET, __NR_read, "fd == 1"
>>   SET, __NR_read, "len < 4097"
>> Results in: "1 && fd == 1 && len < 4097", and SET, nr, "0" would
>> nullify the syscall filter in total.
>
> Only that that was not applied? We can't let tasks nullify their
> restrictions once they have been applied. This keeps the kernel code
> simpler.

Ah - so I really need to be more explicit when discussing these
things! In the "simplification" effort, I was thinking any syscall
with no entry has a "0" rule. So if if nullify it, it becomes a
complete block and if you can't OR, then you can't add permissions.

>>   It seems like that would be
>> enough to build the SET-SET-...-APPLY, SET-SET-...-SET-APPLY logic
>> into a userspace library so that all temporary unapplied state doesn't
>> have to be explicitly managed by the kernel.
>
> Thus, the SETs are done in the userspace library that does not need to
> interact with the kernel (besides perhaps allocating memory). Then the
> apply would send all the filters to the kernel which would restrict the
> task (or the task on exec) further.

Exactly. Smaller patch and less state per-filter entry (I hope!).


>>
>> While I completely agree with the comment around ease-of-use as being
>> key to security, I also find that the more the state diagram explodes,
>> the harder it is to feel confident that a solution is actually secure.
>>  To try to achieve both objectives, I'd like to limit the kernel
>> interface to the bare minimum of primitives and build any API
>> fanciness into userspace.
>
> Fair enough.
>
>>
>> Does it seem that the tradeoff isn't worth it, or are there some
>> specific behaviors that aren't addressed using that model?
>>
>> While writing that, another option occurred to me that touches on the
>> other proposals but makes the behaviors much more explicit.
>> A prctl prototype could be provided:
>>  prctl(<SET|GET>, <AND|OR>, <syscall_nr>, <filter string>)
>> e.g.,
>>  prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_OR, __NR_read, "fd == 2");
>>
>> The explicit prctl argument list would allow the filter strings to be
>> self-referential and allow the userspace app to decide what behaviors
>> are allowed and when. If we followed that route, all implicit filters
>> would be "0" and the initial call to get things started might be:
>>    #define SET 33
>>    #define OR 0
>>    #define AND 1
>>    SET, OR, __NR_prctl, "option == 33 && (arg1 == 0 || arg1 == 1)"
>>    prctl(PR_SET_SECCOMP, 2);
>>
>> So now the "locked down" binary can call prctl to set an OR or AND
>> filter for any syscall.  A subsequent call could change that:
>>   SET, OR, __NR_read, "fd == 2"  /* => "0 || fd == 2" */
>>   SET, AND, __NR_prctl, "(arg2 != 63 || arg1 != 0)"  /* __NR_read == 63 */
>>
>> This would OR in a __NR_read filter, then disallow a future call to
>> prctl to OR in more NR_read filters, but for other syscalls ANDing and
>> ORing is still possible until you pass in something like:
>>
>>   SET, AND, __NR_prctl, "arg1 == 1"
>>
>> which would lock down all future prctl calls to only ANDing filters
>> in.  (The numbers in the examples could then be properly managed in a
>> userspace library to ensure platform correctness.)
>
> I don't know about this. It seems to be starting to get too complex, and
> thus error prone. Is there any reason we should allow an OR to the task?
> Why would we want to restrict a task where the task could easily
> unrestrict itself?

No idea! I can't think of any good examples where you'd want to do
it, just contrived ones. In general, I think the above approach would
rarely be used since I expect that something like 80% of the places
where this will be used will just be one-time, upfront filter installs
without any surface reduction after the fact.

That said, if there's no reason to support OR after the fact, then the
interface can just _only_ support &&s and leave the installation to
userspace. It might makes the multiple-fd-ORing case less fun in
userspace, but it should work for most cases I think.

>>
>> While this would reduce the primitives a bit further, I'm not sure if
>> this would be the right approach either, but it would open the door to
>> pushing even more down to userspace very explicitly and further
>> removing magic policy logic from the kernel-side.  Is this vaguely
>> interesting or just another layer of confusing-ness?
>
> I'm confused, thus I must have hit that layer ;)

Sounds like it. I'm always a sucker for self-referential mechanisms.
I've been travelling a bit recently so my code output has been a bit
low, but I'll pull together the most minimal approach that I think
we've been iterating toward and hopefully post something in the not
too distant future.

thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/