Re: Rseq registration: Google tcmalloc vs glibc

From: Chris Kennelly
Date: Wed Feb 26 2020 - 12:27:52 EST


On Wed, Feb 26, 2020 at 12:01 PM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> ----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly ckennelly@xxxxxxxxxx wrote:
>
> > On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
> >> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >> >
> >> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
> >> > joel@xxxxxxxxxxxxxxxxx wrote:
> >> >
> >> > [...]
> >> > >>
> >> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> >> > >> registered.
> >> > >>
> >> > >> - Current protocol in the most recent glibc integration patch set.
> >> > >> - Not supported yet by Linux kernel rseq selftests,
> >> > >> - Not supported yet by tcmalloc,
> >> > >>
> >> > >> Use the per-thread state to figure out whether each thread need to register
> >> > >> Rseq individually.
> >> > >>
> >> > >> Works for integration between a library which exists for the entire lifetime
> >> > >> of the executable (e.g. glibc) and other libraries. However, it does not
> >> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> >> > >> having a library like glibc handling the registration present.
> >> > >
> >> > > Mathieu, could you share more details about why during dlopen/close
> >> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
> >> > > registered?
> >> >
> >> > Sure,
> >> >
> >> > A library which is only loaded and never closed during the execution of the
> >> > program can let the kernel implicitly unregister rseq at thread exit. For
> >> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
> >> > each thread's __rseq_abi which sit in a library which is going to be
> >> > dlclose'd.
> >>
> >> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
> >> sounds from Chris's reply that tcmalloc already checks
> >> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
> >> seems to already handle things properly - CMIIW.
> >
> > I'll make a note about this, since we can probably benefit from some
> > more comments about the assumptions/invariants the fastpath uses.
>
> I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will not
> behave correctly with the current tcmalloc implementation.
>
> Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As long
> as this is the only expected caller, having IsFast comparing the RseqCpuId detects
> whether glibc (or some other library) has already registered rseq for the current
> thread.
>
> However, if the application chooses to invoke InitFastPerCpu() directly, things become
> expected, because it invokes:
>
> absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);
>
> which AFAIU invokes InitPerCpu once after execution of the current program. Which
> does:
>
> static bool InitThreadPerCpu() {
> if (__rseq_refcount++ > 0) {
> return true;
> }
>
> auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
> PERCPU_RSEQ_SIGNATURE);
> if (ret == 0) {
> return true;
> } else {
> __rseq_refcount--;
> }
>
> return false;
> }
>
> static void InitPerCpu() {
> // Based on the results of successfully initializing the first thread, mark
> // init_status to initialize all subsequent threads.
> if (InitThreadPerCpu()) {
> init_status = kFastMode;
> }
> }
>
> In a scenario where glibc has already registered Rseq, the __rseq_refcount will
> be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the refcount
> will be immediately decremented and it will return false. Therefore, "init_status" will
> never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of this
> program. That being said, even though this state can come as a surprise, it seems to
> be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it won't
> have any observable side-effects other than leaving init_status in a state that does not
> match reality.

I agree that this could potentially violate inviarants, but
InitFastPerCpu is not intended to be called by the application.

> In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library, but glibc
> does not provide Rseq registration, we run into issues as well if the dlopened library
> registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq has been
> observed as registered in the past for a thread, it stays registered. However, if a
> dlclosed library unregisters Rseq, we need to be prepared to re-register it. So either
> tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even when Rseq
> is registered (this would hurt the fast-path however, and I would hate to have to do this),
> or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by a dlclosed
> library which was the actual owner of the Rseq registration.

We have a bit of an opportunity to figure out whether this is the
first time--from TCMalloc's perspective--a thread is doing per-CPU and
bump the __rseq_count accordingly. I think this could be done off of
the fast path.

Chris