Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

From: Sabrina Dubroca
Date: Wed Feb 25 2015 - 17:41:18 EST


2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote:
> On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
> > On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
> >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin <avagin@xxxxxxxxx> wrote:
> >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko <dvlasenk@xxxxxxxxxx>:
> >>>> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> >>>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko <dvlasenk@xxxxxxxxxx>:
> >>>>>> 64-bit code was using six stack slots less by not saving/restoring
> >>>>>> registers which are callee-preserved according to C ABI,
> >>>>>> and not allocating space for them.
> >>>>>> Only when syscall needed a complete "struct pt_regs",
> >>>>>> the complete area was allocated and filled in.
> >>>>>> As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
> >>>>>> trick is used, to make nested interrupt stacks easier to unwind.
> >>>>>>
> >>>>>> This proved to be a source of significant obfuscation and subtle bugs.
> >>>>>> For example, stub_fork had to pop the return address,
> >>>>>> extend the struct, save registers, and push return address back. Ugly.
> >>>>>> ia32_ptregs_common pops return address and "returns" via jmp insn,
> >>>>>> throwing a wrench into CPU return stack cache.
> >>>>>>
> >>>>>> This patch changes code to always allocate a complete "struct pt_regs".
> >>>>>> The saving of registers is still done lazily.
> >>>>>>
> >>>>>> "Partial pt_regs" trick on interrupt stack is retained.
> >>>>>>
> >>>>>> Macros which manipulate "struct pt_regs" on stack are reworked:
> >>>>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
> >>>>>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
> >>>>>> SAVE_EXTRA_REGS saves to it all other registers.
> >>>>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
> >>>>>>
> >>>>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
> >>>>>> return pointer.
> >>>>>>
> >>>>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
> >>>>>> instead of magic numbers.
> >>>>>>
> >>>>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
> >>>>>> instead of having it open-coded yet again.
> >>>>>>
> >>>>>> Patch was run-tested: 64-bit executables, 32-bit executables,
> >>>>>> strace works.
> >>>>>> Timing tests did not show measurable difference in 32-bit
> >>>>>> and 64-bit syscalls.
> >>>>>
> >>>>> Hello Denys,
> >>>>>
> >>>>> My test vm doesn't boot with this patch. Could you help to investigate
> >>>>> this issue?
> >>>>
> >>>> I think I found it. This part of my patch is possibly wrong:
> >>>>
> >>>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
> >>>> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \
> >>>> TRACE_IRQS_ON; \
> >>>> sti; \
> >>>> - SAVE_REST; \
> >>>> + SAVE_EXTRA_REGS; \
> >>>> LOCKDEP_SYS_EXIT; \
> >>>> - RESTORE_REST; \
> >>>> + RESTORE_EXTRA_REGS; \
> >>>> cli; \
> >>>> TRACE_IRQS_OFF;
> >>>>
> >>>> The "SAVE_REST" here is intended to really *push* extra regs on stack,
> >>>> but the patch changed it so that they are written to existing stack
> >>>> slots above.
> >>>>
> >>>> From code inspection it should work in almost all cases, but some
> >>>> locations where it is used are really obscure.
> >>>>
> >>>> If there are places where *pushing* regs is really necessary,
> >>>> this can corrupt rbp,rbx,r12-15 registers.
> >>>>
> >>>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
> >>>> was here.
> >>>> Please find updated patch attached. Can you try it?
> >>>
> >>> It doesn't work
> >
> > Thanks for testing it anyway.
> >
> >
> >>> [ 3.016262] traps: systemd-cgroups[390] general protection
> >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
> >>> ld-2.18.so[7f456f79e000+20000]
> >
> > This is what I know about these crashes. The SEGV itself is caused by
> > HLT instruction executed by dynamic loader, ld-2.NN.so.
> > The instruction is in _exit function, and is only reachable if
> > exit_group and exit syscalls fail to terminate the process.
> > So it seems that syscall execution is getting badly broken somehow
> > at some point.
> >
> > This happens to both reporters.
> >
> > My theory that it is related to lockdep seems to be wrong, because
> > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
> >
> > Both kernels are paravirtualized, both are booted under KVM,
> > Andrey runs it with four virtual CPUs, Sabrina runs with two.
> >
> > My next theory is that I missed something related to paravirt.
> > I am looking at that code, so far I don't see anything suspicious.
> >
> > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
> > I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
> > I see in dmesg that kernel does detect that it is being run under KVM,
> > but it works for me. No mysterious segfaults.
> >
> > Andrey, can you send me your bzImage? Maybe it will trigger
> > the problem for me.
> >
> >
> >> The change to stub_\func looks wrong to me. It saves and restores
> >> regs, but those regs might already have been saved if we're on the
> >> slow path. (Yes, all that code is quite buggy even without all these
> >> patches.) So is execve.
> >>
> >> This means that, for example, execve called in the slow path will
> >> save/restore regs twice. If the values in the regs after the first
> >> save and before the second save are different, then we corrupt user
> >> state.
> >
> > This part?
> >
> > .macro FORK_LIKE func
> > ENTRY(stub_\func)
> > CFI_STARTPROC
> > - popq %r11 /* save return address */
> > - PARTIAL_FRAME 0
> > - SAVE_REST
> > - pushq %r11 /* put it back on stack */
> > + DEFAULT_FRAME 0, 8 /* offset 8: return address */
> > + SAVE_EXTRA_REGS 8
> > FIXUP_TOP_OF_STACK %r11, 8
> > - DEFAULT_FRAME 0 8 /* offset 8: return address */
> > call sys_\func
> > RESTORE_TOP_OF_STACK %r11, 8
> > - ret $REST_SKIP /* pop extended registers */
> > + ret
> > CFI_ENDPROC
> > END(stub_\func)
> > .endm
> >
> > FORK_LIKE clone
> > FORK_LIKE fork
> > FORK_LIKE vfork
> >
> > But the old code (SAVE_REST thing) was also saving registers here.
> > It had to jump through hoops (pop return address, SAVE_REST,
> > push return address) to do that.
> > After the patch, "SAVE_EXTRA_REGS 8" does the same, just without
> > pop/push pair.
> >
> > I just don't see what's wrong with it. Can you elaborate?
>
> SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just
> writes them in place. It's possible for this to be called when the
> regs have already been saved.
>
> >
> > And this area of code has no paravirt gunk, so if the bug is here,
> > why it doesn't fail for people running this natively?
>
> I don't know whether paravirt is involved. It could be something else.

After reading Denys's last mail, I tried booting the same VM
- with 1 cpu
- without CONFIG_PARAVIRT
- with x86_64_defconfig

and I still get the same traps in all 3 cases.

I can run some userspace programs, but I have no idea what would be
helpful.
I can also try booting a real machine with archlinux/systemd tomorrow.

--
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/