Re: rseq event_counter field

From: Kyle Huey
Date: Mon Oct 30 2017 - 14:08:36 EST


On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> Answering both emails here.
>
> Also, welcome Kyle. Kyle, how badly does rseq's proposed
> event_counter break rr? For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter. If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@xxxxxx> wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code. With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff. Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(&rseq, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well. You
> probably have a progress guarantee, and the database gives you the
> semantics you want. With rseq, you don't get any of this. If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress. You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction. In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(&rseq, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant. But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code. And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well. The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number. On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE. (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
> If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = &the_rseq_cs_here;
>
> but that's a *store* to the rseq cacheline. It'll go in the store
> buffer and, in the fast case, nothing ever depends on it. Hence it's
> unlikely to stall. Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI. Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>
>>
>> RDPID is interesting, but it seems like given the current ABI (which
>> requires setting the current restartable range) you'd still need some sort
>> of TLS value. Have any benchmarks of RDPID perf vs TLS?
>
> That's the wrong comparison. For a good userspace implementation
> (which is very awkward without libc's help), storing to the rseq
> pointer is:
>
> leaq the_rseq_cs(%rip), %rax
> movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)
>
> loading the cpu number is:
>
> movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax
>
> The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
> fixed) instead. This avoids reading from the rseq cacheline.
>
>
> On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@xxxxxxxxxx wrote:
>>
>>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>>> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>> Let me step back a bit, and try to understand the fundamental
>> motivation for not letting user-space detect that it has been
>> preempted.
>>
>> My understanding (please let me know where I am mistaken) is that
>> the motivation for not letting user-space know if preempted is to
>> ensure malware cannot cloak (hide) themselves by changing their
>> behavior when they detect that they are heavily preempted. This
>> is effectively a way to detect things like single-stepping, strace,
>> ltrace, and so on.
>
> That's about a third of my objection. The next third is that it seems
> messy to be to intentionally expose what should be an invisible
> detail, and the third is that I still haven't seen convincing evidence
> that anything genuinely needs an event counter. See below and my
> other email.
>
>>
>> One can also pretty much trivially use rdtsc (x86), tb register
>> (powerpc) or ARM ccnt register (if user access is enabled) to
>> detect whether it runs under observation.
>> (this would be your timing-and-guessing approach)
>>
>
> Mostly irrelevant, but rdtsc is trappable.
>
>> I hope that very *few* users will completely open-code their asm
>> sequences. For instance, if you do the whole list pop in asm, you
>> need to:
>>
>> 1) store current rseq_cs pointer
>> 2) load CPU number from struct rseq,
>> 3) pointer-chase the right per-cpu data that you want to act on,
>> 4) load head
>> 5) compare head against NULL
>> 6) conditional branch
>> 6) load head->next
>> 10) store head->next to head (commit)
>>
>> step 3) is the interesting bit here:
>> It heavily depends on the application per-cpu data layout. It can be
>> a large array of per-cpu integers, or an array of per-cpu structures
>> at a specific offset, or an array of per-cpu struct + offset + pointer
>> to some other structure (dynamically allocated), and so on. I do not
>> want to impose any restriction on the data structure layout of
>> user-space applications. This means the asm-specialized-stuff will
>> have to be specialized for each per-cpu data structure layout, which
>> is a pain.
>
> To my point: I don't think this is a particularly good example. I
> think your examples are buggy (see below), and I would implement this
> quite differently if I were trying to use C (or even asm, perhaps),
> like this:
>
> 1) load CPU number.
> 2) pointer-chase the right per-cpu data, in C if desired.
> 3) store rseq_cs (which pretty much has to be in asm using the current ABI)
> 4) load head, compare against null, load head->next
> 5) commit
>
> I don't see how the event counter is at all useful here.
>
> FWIW, I can easily imagine alternative ABIs in which abort_ip is
> treated as a *call* instead of a jump to enable much better C support.
>
>> Having the event_counter in place allows users to code the
>> pointer-chasing in pure C, which we can expect will help adoption
>> and ensure people use rseq through a utility library header file
>> rather than open-coding everything (with the associated testing
>> problems).
>
> True, but my variant works just as well, might be faster, and doesn't
> need event_counter.
>
>>
>> If we look at the list pop alternative using C for pointer-chasing,
>> with event_counter:
>>
>> <C>
>> 1) 64-bit load of both event_counter and cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
>
> ^^^ Potential segfault here. head might have been freed and unmapped.
>
>> <ASM>
>> 9) store current rseq_cs
>> 10) Load event counter
>> 11) compare current event_counter with event_counter loaded in 1)
>> 12) conditional branch
>> 13) store new value to @loc
>>
>> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
>> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
>> stall due to pointer-chasing.
>>
>> And here is the resulting list pop where the algorithm is in C,
>> without sequence counter. Let's make it a simple case
>> where the application is not changing the layout of the data
>> structures being pointer-chased concurrently from a book-keeping
>> algorithm (because that would add extra compare in the asm).
>>
>> <C>
>> 1) 32-bit load of cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
>
> ^^^ ditto: potential segfault here
>
>> <ASM>
>> 7) store current rseq_cs
>> 8) 32-bit load of cpu_id
>> 9) compare cpu_id with prior cpu_id read in 1)
>> 10) conditional branch
>> 11) Load head
>> 12) compare head against head loaded in 3)
>> 13) conditional branch
>> 14) Load head->next
>> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
>> 16) conditional branch
>> 17) store head->next as new head
>>
>>
>> Would it be required to change the kernel-user ABI when switching to
>> RDPID ? Can this co-exist with TLS users ?
>
> RDPID gives you the CPU number without reading memory, and that's
> pretty much all it does. The format in which it returns it sucks,
> unfortunately, but I'm going to try to fix that by the time rseq is
> merged.
>
> --Andy

(+roc)

Thanks for roping me in. I'm not sure I fully understand the proposal
(are there docs somewhere outside of this email thread?) but I don't
think this will be a huge problem for rr because:

1) Since rseq is initiated through a syscall we can always make that
fail and deal with not recording programs that use it. As long as it
doesn't become widely used in glibc or something that's probably
bearable.
2) We ought to be able to emulate it if we have to, since everything
is mediated by the kernel rr might need to be able to see when a
thread migrates from one CPU to another (via a perf event or
something, idk if this functionality already exists) and when a rseq
restart happens (via ptrace or perf events or something).

RDPID sounds annoying. My copy of the SDM doesn't seem to indicate
whether it obeys the RDTSC(P) faulting mechanism. Does anyone know
what microarchitecture introduced that instruction? We could always
mask off the CPUID bit indicating it's supported.

A similar concept on ARM (the load-lock-store-condition pattern of the
LDREX/STREX instructions) doomed our effort to port rr to ARM a few
years back. The key problems there for us were:

1. Hardware interrupts in the critical section trigger a failed store
because the kernel will grab the per-CPU exclusive lock.
2. There's no way to observe when a store fails so as to induce the
same failures during replay, and
3. That pattern used all over the place for every atomic operation on
ARM so blacklisting or emulating it wasn't feasible.

If everything is moderated by the kernel and the kernel exposes enough
information through ptrace/perf events/whatever I think we'll be ok

Thanks,

- Kyle