Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

From: Denys Vlasenko
Date: Wed Feb 25 2015 - 16:29:28 EST


On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin <avagin@xxxxxxxxx> wrote:
>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko <dvlasenk@xxxxxxxxxx>:
>>> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
>>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko <dvlasenk@xxxxxxxxxx>:
>>>>> 64-bit code was using six stack slots less by not saving/restoring
>>>>> registers which are callee-preserved according to C ABI,
>>>>> and not allocating space for them.
>>>>> Only when syscall needed a complete "struct pt_regs",
>>>>> the complete area was allocated and filled in.
>>>>> As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
>>>>> trick is used, to make nested interrupt stacks easier to unwind.
>>>>>
>>>>> This proved to be a source of significant obfuscation and subtle bugs.
>>>>> For example, stub_fork had to pop the return address,
>>>>> extend the struct, save registers, and push return address back. Ugly.
>>>>> ia32_ptregs_common pops return address and "returns" via jmp insn,
>>>>> throwing a wrench into CPU return stack cache.
>>>>>
>>>>> This patch changes code to always allocate a complete "struct pt_regs".
>>>>> The saving of registers is still done lazily.
>>>>>
>>>>> "Partial pt_regs" trick on interrupt stack is retained.
>>>>>
>>>>> Macros which manipulate "struct pt_regs" on stack are reworked:
>>>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
>>>>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
>>>>> SAVE_EXTRA_REGS saves to it all other registers.
>>>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>>>>>
>>>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
>>>>> return pointer.
>>>>>
>>>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
>>>>> instead of magic numbers.
>>>>>
>>>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
>>>>> instead of having it open-coded yet again.
>>>>>
>>>>> Patch was run-tested: 64-bit executables, 32-bit executables,
>>>>> strace works.
>>>>> Timing tests did not show measurable difference in 32-bit
>>>>> and 64-bit syscalls.
>>>>
>>>> Hello Denys,
>>>>
>>>> My test vm doesn't boot with this patch. Could you help to investigate
>>>> this issue?
>>>
>>> I think I found it. This part of my patch is possibly wrong:
>>>
>>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>>> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \
>>> TRACE_IRQS_ON; \
>>> sti; \
>>> - SAVE_REST; \
>>> + SAVE_EXTRA_REGS; \
>>> LOCKDEP_SYS_EXIT; \
>>> - RESTORE_REST; \
>>> + RESTORE_EXTRA_REGS; \
>>> cli; \
>>> TRACE_IRQS_OFF;
>>>
>>> The "SAVE_REST" here is intended to really *push* extra regs on stack,
>>> but the patch changed it so that they are written to existing stack
>>> slots above.
>>>
>>> From code inspection it should work in almost all cases, but some
>>> locations where it is used are really obscure.
>>>
>>> If there are places where *pushing* regs is really necessary,
>>> this can corrupt rbp,rbx,r12-15 registers.
>>>
>>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
>>> was here.
>>> Please find updated patch attached. Can you try it?
>>
>> It doesn't work

Thanks for testing it anyway.


>> [ 3.016262] traps: systemd-cgroups[390] general protection
>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
>> ld-2.18.so[7f456f79e000+20000]

This is what I know about these crashes. The SEGV itself is caused by
HLT instruction executed by dynamic loader, ld-2.NN.so.
The instruction is in _exit function, and is only reachable if
exit_group and exit syscalls fail to terminate the process.
So it seems that syscall execution is getting badly broken somehow
at some point.

This happens to both reporters.

My theory that it is related to lockdep seems to be wrong, because
Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.

Both kernels are paravirtualized, both are booted under KVM,
Andrey runs it with four virtual CPUs, Sabrina runs with two.

My next theory is that I missed something related to paravirt.
I am looking at that code, so far I don't see anything suspicious.

Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
I see in dmesg that kernel does detect that it is being run under KVM,
but it works for me. No mysterious segfaults.

Andrey, can you send me your bzImage? Maybe it will trigger
the problem for me.


> The change to stub_\func looks wrong to me. It saves and restores
> regs, but those regs might already have been saved if we're on the
> slow path. (Yes, all that code is quite buggy even without all these
> patches.) So is execve.
>
> This means that, for example, execve called in the slow path will
> save/restore regs twice. If the values in the regs after the first
> save and before the second save are different, then we corrupt user
> state.

This part?

.macro FORK_LIKE func
ENTRY(stub_\func)
CFI_STARTPROC
- popq %r11 /* save return address */
- PARTIAL_FRAME 0
- SAVE_REST
- pushq %r11 /* put it back on stack */
+ DEFAULT_FRAME 0, 8 /* offset 8: return address */
+ SAVE_EXTRA_REGS 8
FIXUP_TOP_OF_STACK %r11, 8
- DEFAULT_FRAME 0 8 /* offset 8: return address */
call sys_\func
RESTORE_TOP_OF_STACK %r11, 8
- ret $REST_SKIP /* pop extended registers */
+ ret
CFI_ENDPROC
END(stub_\func)
.endm

FORK_LIKE clone
FORK_LIKE fork
FORK_LIKE vfork

But the old code (SAVE_REST thing) was also saving registers here.
It had to jump through hoops (pop return address, SAVE_REST,
push return address) to do that.
After the patch, "SAVE_EXTRA_REGS 8" does the same, just without
pop/push pair.

I just don't see what's wrong with it. Can you elaborate?

And this area of code has no paravirt gunk, so if the bug is here,
why it doesn't fail for people running this natively?

--
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/