Re: [RFC 2/2] x86_64: expand kernel stack to 16K

From: Andy Lutomirski
Date: Tue Oct 21 2014 - 00:59:40 EST


On 10/20/2014 07:00 PM, Dave Jones wrote:
> On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
> > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> > >
> > > If we removed struct thread_info from the stack allocation then one
> > > could do a guard page below the stack. Of course, we'd have to use IST
> > > for #PF in that case, which makes it a non-production option.

Why is thread_info in the stack allocation anyway? Every time I look at
the entry asm, one (minor) thing that contributes to general
brain-hurtingness / sense of horrified awe is the incomprehensible (to
me) split between task_struct and thread_info.

struct thread_info is at the bottom of the stack, right? If we don't
want to merge it into task_struct, couldn't we stick it at the top of
the stack instead? Anything that can overwrite the *top* of the stack
gives trivial user-controlled CPL0 execution regardless.

> >
> > We could just have the guard page in between the stack and the
> > thread_info, take a double fault, and then just map it back in on
> > double fault.
> >
> > That would give us 8kB of "normal" stack, with a very loud fault - and
> > then an extra 7kB or so of stack (whatever the size of thread-info is)
> > - after the first time it traps.
> >
> > That said, it's still likely a non-production option due to the page
> > table games we'd have to play at fork/clone time.

What's wrong with vmalloc? Doesn't it already have guard pages?

(Also, we have a shiny hardware dirty bit, so we could relatively
cheaply check whether we're near the limit without any weird
#PF-in-weird-context issues.)

Also, muahaha, I've infected more people with the crazy idea that
intentional double-faults are okay. Suckers! Soon I'll have Linux
returning from interrupts with lret! (IIRC Windows used to do
intentional *triple* faults on context switches, so this should be
considered entirely sensible.)

>
> [thread necrophilia]
>
> So digging this back up, it occurs to me that after we bumped to 16K,
> we never did anything like the debug stuff you suggested here.
>
> The reason I'm bringing this up, is that the last few weeks, I've been
> seeing things like..
>
> [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left
>
> So we're now eating past that first 8KB in some situations.
>
> Do we care ? Or shall we only start worrying if it gets even deeper ?

I would *love* to have an immediate, loud failure when we overrun the
stack. This will unavoidably increase the number of TLB misses, but
that probably isn't so bad.

--Andy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/