Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

From: Kees Cook
Date: Tue Jun 16 2020 - 14:49:45 EST


On Tue, Jun 16, 2020 at 08:36:28PM +0200, Jann Horn wrote:
> On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> > > Wouldn't it be simpler to use a function that can run a subset of
> > > seccomp cBPF and bails out on anything that indicates that a syscall's
> > > handling is complex or on instructions it doesn't understand? For
> > > syscalls that have a fixed policy, a typical seccomp filter doesn't
> > > even use any of the BPF_ALU ops, the scratch space, or the X register;
> > > it just uses something like the following set of operations, which is
> > > easy to emulate without much code:
> > >
> > > BPF_LD | BPF_W | BPF_ABS
> > > BPF_JMP | BPF_JEQ | BPF_K
> > > BPF_JMP | BPF_JGE | BPF_K
> > > BPF_JMP | BPF_JGT | BPF_K
> > > BPF_JMP | BPF_JA
> > > BPF_RET | BPF_K
> >
> > Initially, I started down this path. It needed a bit of plumbing into
> > BPF to better control the lifetime of the cBPF "saved original filter"
> > (normally used by CHECKPOINT_RESTORE uses)
>
> I don't think you need that? When a filter is added, you can compute
> the results of the added individual filter, and then merge the state.

That's what I thought too, but unfortunately not (unless I missed
something) -- the seccomp verifier is run as a callback from the BPF
internals, so seccomp only see what the user sends (which is unverified)
and the final eBPF filter. There isn't state I can attach during the
callback, so I opted to just do the same thing as CHECKPOINT_RESTORE,
but to then explicitly free the cBPF after bitmap generation.

> > and then I needed to keep
> > making exceptions (same list you have: ALU, X register, scratch, etc)
> > in the name of avoiding too much complexity in the emulator. I decided
> > I'd rather reuse the existing infrastructure to actually execute the
> > filter (no cBPF copy needed to be saved, no separate code, and full
> > instruction coverage).
>
> If you really think that this bit of emulation is so bad, you could
> also make a copy of the BPF filter in which you replace all load
> instructions from syscall arguments with "return NON_CONSTANT_RESULT",
> and then run that through the normal BPF infrastructure.
>
> > > Something like (completely untested):
> [...]
> > I didn't actually finish going down the emulator path (I stopped right
> > around the time I verified that libseccomp does use BPF_ALU -- though
> > only BPF_AND), so I didn't actually evaluate the filter contents for other
> > filter builders (i.e. Chrome).
> >
> > But, if BPF_ALU | BPF_AND were added to your code above, it would cover
> > everything libseccomp generates (which covers a lot of the seccomp
> > filters, e.g. systemd, docker). I just felt funny about an "incomplete"
> > emulator.
> >
> > Though now you've got me looking. It seems this is the core
> > of Chrome's BPF instruction generation:
> > https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
> > It also uses ALU|AND, but adds JMP|JSET.
> >
> > So... that's only 2 more instructions to cover what I think are likely
> > the two largest seccomp instruction generators.
> >
> > > That way, you won't need any of this complicated architecture-specific stuff.
> >
> > There are two arch-specific needs, and using a cBPF-subset emulator
> > just gets rid of the local TLB flush. The other part is distinguishing
> > the archs. Neither requirement is onerous (TLB flush usually just
> > needs little more than an extern, arch is already documented in the
> > per-arch syscall_get_arch()).
>
> But it's also somewhat layer-breaking and reliant on very specific
> assumptions. Normal kernel code doesn't mess around with page table
> magic, outside of very specific low-level things. And your method
> would break if the fixed-value members were not all packed together at
> the start of the structure.

Right -- that was lucky. I suspect the emulation route will win out
here.

> And from a hardening perspective: The more code we add that fiddles
> around with PTEs directly, rather than going through higher-level
> abstractions, the higher the chance that something gets horribly
> screwed up. For example, this bit from your patch looks *really*
> suspect:
>
> + preempt_disable();
> + set_pte_at(&init_mm, vaddr, ptep,
> pte_mkold(*(READ_ONCE(ptep))));
> + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> + preempt_enable();
>
> First off, that set_pte_at() is just a memory write; I don't see why
> you put it inside a preempt_disable() region.
> But more importantly, sticking a local TLB flush inside a
> preempt_disable() region with nothing else in there looks really
> shady. How is that supposed to work? If we migrate from CPU0 to CPU1
> directly before this region, and then from CPU1 back to CPU0 directly
> afterwards, the local TLB flush will have no effect.

Yeah, true, that's another good reason not to do this.

--
Kees Cook