Re: Rseq registration: Google tcmalloc vs glibc
From: Mathieu Desnoyers
Date: Wed Feb 26 2020 - 13:56:37 EST
----- On Feb 26, 2020, at 12:27 PM, Chris Kennelly ckennelly@xxxxxxxxxx wrote:
> On Wed, Feb 26, 2020 at 12:01 PM Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>> ----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly ckennelly@xxxxxxxxxx wrote:
>>
>> > On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>> >>
>> >> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
>> >> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> >> >
>> >> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
>> >> > joel@xxxxxxxxxxxxxxxxx wrote:
>> >> >
>> >> > [...]
>> >> > >>
>> >> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
>> >> > >> registered.
>> >> > >>
>> >> > >> - Current protocol in the most recent glibc integration patch set.
>> >> > >> - Not supported yet by Linux kernel rseq selftests,
>> >> > >> - Not supported yet by tcmalloc,
>> >> > >>
>> >> > >> Use the per-thread state to figure out whether each thread need to register
>> >> > >> Rseq individually.
>> >> > >>
>> >> > >> Works for integration between a library which exists for the entire lifetime
>> >> > >> of the executable (e.g. glibc) and other libraries. However, it does not
>> >> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
>> >> > >> having a library like glibc handling the registration present.
>> >> > >
>> >> > > Mathieu, could you share more details about why during dlopen/close
>> >> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
>> >> > > registered?
>> >> >
>> >> > Sure,
>> >> >
>> >> > A library which is only loaded and never closed during the execution of the
>> >> > program can let the kernel implicitly unregister rseq at thread exit. For
>> >> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
>> >> > each thread's __rseq_abi which sit in a library which is going to be
>> >> > dlclose'd.
>> >>
>> >> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
>> >> sounds from Chris's reply that tcmalloc already checks
>> >> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
>> >> seems to already handle things properly - CMIIW.
>> >
>> > I'll make a note about this, since we can probably benefit from some
>> > more comments about the assumptions/invariants the fastpath uses.
>>
>> I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will
>> not
>> behave correctly with the current tcmalloc implementation.
>>
>> Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As
>> long
>> as this is the only expected caller, having IsFast comparing the RseqCpuId
>> detects
>> whether glibc (or some other library) has already registered rseq for the
>> current
>> thread.
>>
>> However, if the application chooses to invoke InitFastPerCpu() directly, things
>> become
>> expected, because it invokes:
>>
>> absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);
>>
>> which AFAIU invokes InitPerCpu once after execution of the current program.
>> Which
>> does:
>>
>> static bool InitThreadPerCpu() {
>> if (__rseq_refcount++ > 0) {
>> return true;
>> }
>>
>> auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
>> PERCPU_RSEQ_SIGNATURE);
>> if (ret == 0) {
>> return true;
>> } else {
>> __rseq_refcount--;
>> }
>>
>> return false;
>> }
>>
>> static void InitPerCpu() {
>> // Based on the results of successfully initializing the first thread, mark
>> // init_status to initialize all subsequent threads.
>> if (InitThreadPerCpu()) {
>> init_status = kFastMode;
>> }
>> }
>>
>> In a scenario where glibc has already registered Rseq, the __rseq_refcount will
>> be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the
>> refcount
>> will be immediately decremented and it will return false. Therefore,
>> "init_status" will
>> never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of
>> this
>> program. That being said, even though this state can come as a surprise, it
>> seems to
>> be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it
>> won't
>> have any observable side-effects other than leaving init_status in a state that
>> does not
>> match reality.
>
> I agree that this could potentially violate inviarants, but
> InitFastPerCpu is not intended to be called by the application.
OK, explicitly documenting this would be a good thing. In my own projects,
I prefix those symbols with double-underscores (__) to indicate that those
are not meant to be called by other means than the static inlines in the API.
There may be use-cases which justify exposing InitFastPerCpu as a public API for
applications though, especially for those which require some level of
real-time guarantees from the malloc/free APIs. I've run into this situation
with liburcu which I maintain.
>
>> In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library,
>> but glibc
>> does not provide Rseq registration, we run into issues as well if the dlopened
>> library
>> registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq
>> has been
>> observed as registered in the past for a thread, it stays registered. However,
>> if a
>> dlclosed library unregisters Rseq, we need to be prepared to re-register it. So
>> either
>> tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even
>> when Rseq
>> is registered (this would hurt the fast-path however, and I would hate to have
>> to do this),
>> or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by
>> a dlclosed
>> library which was the actual owner of the Rseq registration.
>
> We have a bit of an opportunity to figure out whether this is the
> first time--from TCMalloc's perspective--a thread is doing per-CPU and
> bump the __rseq_count accordingly. I think this could be done off of
> the fast path.
Is there an explicit tcmalloc API call that each thread need to do before starting
to use tcmalloc to allocate and free memory ? If not, you'll probably need to add
at least a load of __rseq_refcount (or some other TLS variable), test and conditional
branch on the fast-path, which is an additional cost I would ideally prefer to avoid.
Or do you have something else in mind ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com