[RFC] weird crap with vdso on uml/i386

From: Al Viro
Date: Fri Aug 19 2011 - 21:19:06 EST


On Fri, Aug 19, 2011 at 10:51:51AM +0200, Richard Weinberger wrote:

> Please slow down a bit. :-)
> All these branches are just for testing purposes.
> That's why I have not announced them nor sent a pull request to Linus.
>
> Anyway, thanks for the hints!

np... FWIW, there's a really ugly bug present in mainline as well as
in mainline + these patches and I'd welcome any help in figuring out
what's going on.

1) USER_OBJS do not see CONFIG_..., so os-Linux/main.c doesn't see
CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA. As the result, uml/i386 doesn't
notice that host vdso is there. That one is easy to fix:
-obj-$(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA) += elf_aux.o
+ifeq ($(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA),y)
+obj-y += elf_aux.o
+CFLAGS_main.o += -DCONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA
+endif
in arch/um/os-Linux/Makefile takes care of that. Unfortunately, it also
exposes a bug in fixrange_init():

2) fixrange_init() gets called with start (and end) not multiple of
PMD_SIZE; moreover, end is very close to the ~0UL - closer than by PMD_SIZE.
Bad things start happening to the loops in there. Again, easy to fix:

diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 8137ccc..39ee674 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -119,19 +119,22 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
int i, j;
unsigned long vaddr;

- vaddr = start;
+ vaddr = start & PMD_MASK;
i = pgd_index(vaddr);
j = pmd_index(vaddr);
pgd = pgd_base + i;
+ start >>= PMD_SHIFT;
+ end = (end - 1) >> PMD_SHIFT;

- for ( ; (i < PTRS_PER_PGD) && (vaddr < end); pgd++, i++) {
+ for ( ; (i < PTRS_PER_PGD) && start <= end; pgd++, i++) {
pud = pud_offset(pgd, vaddr);
if (pud_none(*pud))
one_md_table_init(pud);
pmd = pmd_offset(pud, vaddr);
- for (; (j < PTRS_PER_PMD) && (vaddr < end); pmd++, j++) {
+ for (; (j < PTRS_PER_PMD) && start <= end; pmd++, j++) {
one_page_table_init(pmd);
vaddr += PMD_SIZE;
+ start++;
}
j = 0;
}

That populates the page tables in the right places and fixrange_user_init()
manages to call it, avoid death-by-oom from runaway allocations and then
install references to all pages it wants. Alas, at that point the things
become really interesting.

3) with the previous two issues dealt with, we get the following magical
mistery shite when running 32bit uml kernel + userland on 64bit host:
* the system boots all the way to getty/login and sshd (i.e. gets
through the debian /etc/init.d (squeeze/i386))
* one can log into it, both on terminals and over ssh. shell and
a bunch of other stuff works. Mostly.
* /bin/bash -c "echo *" reliably segfaults. Always. So does tab
completion in bash, for that matter.
* said segfault is reproducible both from shell and under gdb.
For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3).
What happens there apparently boils down to __kernel_vsyscall() getting
called (and yes, sys_brk() is called, succeeds and results in expected
value in %eax) and corrupting the living hell out of %ecx. Namely, on
return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal
to (original value of) %ebp. All registers except %eax and %ecx (including
%esp and %ebp) remain unchanged.
Again, that happens only on the same call of brk(3) - all previous
calls succeed as expected. I don't believe that it's a race. I also
very much doubt that we are calling the wrong location - it's hard to tell
with the call being call *%gs:0x10 (is there any way to find what that
is equal to in gdb, BTW? Short of hot-patching movl *%gs:0x10,%eax in place
of that call and single-stepping it, that is...) but it *does* end up
making the system call that ought to have been made, so I suspect that it
does hit __kernel_vsyscall(), after all...

The text of __kernel_vsyscall() is
0xffffe420 <__kernel_vsyscall+0>: push %ebp
0xffffe421 <__kernel_vsyscall+1>: mov %ecx,%ebp
0xffffe423 <__kernel_vsyscall+3>: syscall
0xffffe425 <__kernel_vsyscall+5>: mov $0x2b,%ecx
0xffffe42a <__kernel_vsyscall+10>: mov %ecx,%ss
0xffffe42c <__kernel_vsyscall+12>: mov %ebp,%ecx
0xffffe42e <__kernel_vsyscall+14>: pop %ebp
0xffffe42f <__kernel_vsyscall+15>: ret
so %ecx on the way out becoming equal to original %ebp is bloody curious -
it would smell like entering that sucker 3 bytes too late and skipping
mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed
on the way out - wrong return address, wrong value in %ebp, changed %esp.
None of that happens. And we are executing that code in userland - i.e.
to get corrupt it would have to get corrupt in *HOST* 32bit VDSO. Which
would have much more visible effects, starting with the next attempt to
run the testcase blowing up immediately instead of waiting (as it actually
does) for the same 10th call of brk()...

I'm at loss, to be honest. The sucker is nicely reproducible, but bisecting
doesn't help at all - it seems to be present all the way back at least to
2.6.33. I hadn't tried to go back further and I hadn't tried to go for
older host kernels, but I wouldn't put too much faith into that... The
reason it hadn't been noticed much earlier is that it works fine on i386
host - aforementioned shit happens only when the entire thing (identical
binary, identical fs image, identical options) is run on amd64. However,
on i386 I have a different __kernel_vsyscall, which might easily be the
reason it doesn't happen there. It's a K7 box with sysenter-based
variant ending up as __kernel_vsyscall(). Hell knows what's going on...
Behaviour is really weird and I'd appreciate any pointers re debugging
that crap. Suggestions?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/