Re: xgetbv nondeterminism

From: Andy Lutomirski
Date: Sat Jun 17 2017 - 02:24:02 EST


On Fri, Jun 16, 2017 at 11:03 AM, H.J. Lu <hjl.tools@xxxxxxxxx> wrote:
> On Fri, Jun 16, 2017 at 10:56 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu <hjl.tools@xxxxxxxxx> wrote:
>>> On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>> On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu <hjl.tools@xxxxxxxxx> wrote:
>>>>> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>>> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu <hjl.tools@xxxxxxxxx> wrote:
>>>>>>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>>>>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu <hjl.tools@xxxxxxxxx> wrote:
>>>>>>>>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>>>>>>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu <hjl.tools@xxxxxxxxx> wrote:
>>>>>>>>>>> It is used for lazy binding the first time when an external function is called.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Maybe I'm just being dense, but why? What does ld.so need to do to
>>>>>>>>>> resolve a symbol and update the GOT that requires using extended
>>>>>>>>>> state?
>>>>>>>>>
>>>>>>>>> Since the first 8 vector registers are used to pass function parameters
>>>>>>>>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
>>>>>>>>> the first 8 vector registers when transferring control to ld.so.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Wouldn't it be faster and more future-proof to recompile the relevant
>>>>>>>> parts of ld.so to avoid using extended state?
>>>>>>>>
>>>>>>>
>>>>>>> Are you suggesting not to use vector in ld.so?
>>>>>>
>>>>>> Yes, exactly.
>>>>>>
>>>>>>> We used to do that
>>>>>>> several years ago, which leads to some subtle bugs, like
>>>>>>>
>>>>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128
>>>>>>
>>>>>> I don't think x86_64 has the issue that ARM has there. The Linux
>>>>>> kernel, for example, has always been compiled to not use vector or
>>>>>> floating point registers on x86 (32 and 64), and it works fine. Linux
>>>>>> doesn't save extended regs on kernel entry and it doesn't restore them
>>>>>> on exit.
>>>>>>
>>>>>> I would suggest that ld.so be compiled without use of vector
>>>>>> registers, that the normal lazy binding path not try to save any extra
>>>>>> regs, and that ifuncs be called through a thunk that saves whatever
>>>>>> registers need saving, possibly just using XSAVEOPT. After all, ifunc
>>>>>> is used for only a tiny fraction of symbols.
>>>>>
>>>>> x86-64 was the only target which used FOREIGN_CALL macros
>>>>> in ld.so, FOREIGN_CALL macros were the cause of race condition
>>>>> in ld.so:
>>>>>
>>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=11214
>>>>>
>>>>> Not to save and restore the first 8 vector registers means that
>>>>> FOREIGN_CALL macros have to be used. We don't want to
>>>>> do that on x86-64.
>>>>>
>>>>>
>>>>
>>>> You're talking about this, right:
>>>>
>>>> commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
>>>> Author: H.J. Lu <hjl.tools@xxxxxxxxx>
>>>> Date: Tue Aug 25 04:33:54 2015 -0700
>>>>
>>>> Save and restore vector registers in x86-64 ld.so
>>>>
>>>> It seems to me that the problem wasn't that the save/restore happened
>>>> on some of the time -- it was that the save and restore code used a
>>>> TLS variable to track its own state. Shouldn't it have been a stack
>>>> variable or even just implicit in the control flow?
>>>
>>> No, it can't use stack variable since _dl_runtime_resolve never
>>> returns.
>>
>> I haven't dug all the way through the source, but surely ifuncs are
>> CALLed, not JMPed to. That means you have a stack somewhere. This
>> stuff is mostly written in C, and local variables should work just
>> fine.
>>
>>>
>>>> In any case, glibc is effectively doing a foreign call anyway, right?
>>>
>>> No.
>>>
>>>> It's doing the foreign call to itself on every lazy binding
>>>> resolution, though, which seems quite expensive. I'm saying that it
>>>> seems like it would be more sensible to do the complicated foreign
>>>> call logic only when doing the dangerous case, which is when lazy
>>>> binding calls an ifunc.
>>>>
>>>> If I were to rewrite this, I would do it like this:
>>>>
>>>> void *call_runtime_ifunc(void (*ifunc)()); // or whatever the
>>>> signature needs to be
>>>
>>> It is unrelated to IFUNC. This is how external function call works.
>>
>> External function call to what external function? Are there any calls
>> to any non-IFUNC external functions that are triggered by runtime
>> resolution?
>>
>> In any event, I still don't understand the issue. The code does this,
>> effectively:
>>
>> PLT -> GOT
>> GOT points to a stub that transfers control to ld.so
>> ld.so resolves the symbol (_dl_fixup, I think)
>> ld.so patches the GOT
>> ld.so jumps to the resolved function
>>
>> As far as I can tell, the only part of the whole process that might
>> touch vector registers at all is elf_ifunc_invoke(). Couldn't all the
>> register saving and restoring be moved to elf_ifunc_invoke()?
>
> Please grep for FOREIGN_CALL the elf directory.

I grepped FOREIGN_CALL. It has no explanation whatsoever and appears
to unconditionally do nothing in the current glibc version.

In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does:

__thread bool must_save;

RTLD_CHECK_FOREIGN_CALL: return must_save;

RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true;

RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false;

RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = old_must_save;

save_state() and restore_state() operate on TLS buffers.

In summary: this is not async-signal-safe. It's also really messy --
there are macros that declare local variables, and the logic isn't
apparent without really digging in to all the code.

I still don't see why this couldn't be:

static void elf_do_foreign_stuff(args here)
{
void *buf = alloca(state_size);
xsaveopt(buf); /* or open-code it if you prefer */
call_the_ifunc();
xrstor(buf);
}

If there's more than just the iifunc (malloc? profiling? printf?)
then all of that could be wrapped as well.

All this stuff comes from:

commit b48a267b8fbb885191a04cffdb4050a4d4c8a20b
Author: Ulrich Drepper <drepper@xxxxxxxxxx>
Date: Wed Jul 29 08:33:03 2009 -0700

Preserve SSE registers in runtime relocations on x86-64.

SSE registers are used for passing parameters and must be preserved
in runtime relocations. This is inside ld.so enforced through the
tests in tst-xmmymm.sh. But the malloc routines used after startup
come from libc.so and can be arbitrarily complex. It's overkill
to save the SSE registers all the time because of that. These calls
are rare. Instead we save them on demand. The new infrastructure
put in place in this patch makes this possible and efficient.

While I think that the control flow is a giant mess and the use of TLS
was a mistake, I think Uli had the right idea: explicitly save the
extended state only when needed.

--Andy