This change adds a new seccomp mode based on the work by
agl@xxxxxxxxxxxxx This mode comes with a bitmask of NR_syscalls size and
an optional linked list of seccomp_filter objects. When in mode 2, all
system calls are first checked against the bitmask to determine if they
are allowed or denied. If allowed, the list of filters is checked for
the given syscall number. If all filter predicates for the system call
match or the system call was allowed without restriction, the process
continues. Otherwise, it is killed and a KERN_INFO notification is
posted.
The filter language itself is provided by the ftrace filter engine.
Related patches tweak to the perf filter trace and free allow the calls
to be shared. Filters inherit their understanding of types and arguments
for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
predefines this information in syscall_metadata associated enter_event
(and exit_event) structures.
The result is that a process may reduce its available interfaces to
the kernel through prctl() without knowing the appropriate system call
number a priori and with the flexibility of filtering based on
register-stored arguments. (String checks suffer from TOCTOU issues and
should be left to LSMs to provide policy for! Don't get greedy :)