Re: [PATCH] mm: larger stack guard gap, between vmas

From: Linus Torvalds
Date: Tue Jul 04 2017 - 19:32:57 EST

On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <ben@xxxxxxxxxxxxxxx> wrote:
> We have:
> bottom = 0xff803fff
> sp = 0xffffb178
> The relevant mappings are:
> ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]

Ugh. So that stack is actually 8MB in size, but the alloca() is about
to use up almost all of it, and there's only about 28kB left between
"bottom" and that 'rwx' mapping.

Still, that rwx mapping is interesting: it is a single page, and it
really is almost exactly 8MB below the stack.

In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
the top of that odd one-page allocation (0xff7fd000).

Can you find out where that is allocated? Perhaps a breakpoint on
mmap, with a condition to catch that particular one?

Because I'm wondering if it was done explicitly as a 8MB stack
boundary allocation, with the "knowledge" that the kernel then adds a
one-page guard page.

I really don't know why somebody would do that (as opposed to just
limiting the stack with ulimit), but the 8MB+4kB distance is kind of

Maybe that one-page mapping is some hack to make sure that no random
mmap() will ever get too close to the stack, so it really is a "guard
mapping", except it's explicitly designed not so much to guard the
stack from growing down further (ulimit does that), but to guard the
brk() and other mmaps from growing *up* into the stack area..

Sometimes user mode does crazy things just because people are insane.
But sometimes there really is a method to the madness.

I would *not* be surprised if the way somebody allocared the stack was
to basically say:

- let's use "mmap()" with a size of 8MB+2 pages to find a
sufficiently sized virtual memory area

- once we've gotten that virtual address space range, let's over-map
the last page as the new stack using MAP_FIXED

- finally, munmap the 8MB in between so that the new stack can grow
down into that gap the munmap creates.

Notice how you end up with exactly the above pattern of allocations,
and how it guarantees that you get a nice 8MB stack without having to
do any locking (you rely on the kernel to just find the 8MB+8kB areas,
and once one has been allocated, it will be "safe").

And yes, it would have been much nicer to just use PROT_NONE for that
initial sizing allocation, but for somebody who is only interested in
carving out a 8MB stack in virtual space, the protections are actually
kind of immaterial, so 'rwx' might be just their mental default.