Re: Per-Processor Data Page

Scott Lurndal (slurn@griffin.engr.sgi.com)
Thu, 9 Dec 1999 15:48:42 -0800 (PST)


> > 4) For pthreads, it is required to have a notion of thread private storage.
>
> Its worth remembering that pthreads is not a good API for efficient performance.

I'll disagree cheerfully with this. I've a bit of experience
in the area and find that the bulk of serious multithreaded applications
use and desire pthreads.

> Several people are not using pthreads, others would drop it except for the
> fact you are stuck with the pthreads locks buring in glibc and seperating
> the two is hard.

This might be true in linux, but is not true at all in the general
industry. Pthreads are quite extensively used in large SMP applications.

Another example of a pthreaded application, available for linux, is
the mysql database software.

>
> You also rarely want thread private storage. Most applications require
> task (in the job to be handled sense) specific storage. My media-server hack
> for example has file handle specific storage because thats what it really
> wants.

I'll also disagree with this. Thread private storage is quite important.

In any case, pthreads support is required by the Single Unix Specification,
and divergence from that does neither linux nor the unix community in general
any good whatsoever. Let's not try to out-microsoft microsoft.

>
> > a thread to access a pointer to its private data. (User level threading
> > schemes, and M-N schemes can also use this mechanism, albeit with
> > a system call to establish a new value for the private data pointer)
>
> Usermode threading can do the same tricks that the kernel does with %esp
> masking. On real processors (ie non x86) registers are cheap enough you can

Of course they can, however, they have all the pitfalls and deficiencies
of the kernel mechanism, one of which is the requirement that thread
stacks be fixed size which is not compatible with pthreads either as
defined by POSIX or as adopted by the Open Group.

> use one for thread storage. The %esp trick overhead generally outweighs
> the TLB flush overheads and nowdays is cheaper than segment overrides

Only if you are willing to live within the limitations of the trick,
which may prevent portable software from being usable on linux.

>
> Having access the the pid at a fixed address is as good as the %esp trick if
> not better. Unfortunately it means you now have to rewrite it each context
> switch or flush the TLB. Both of these suck. This is one obvious reason for

You would rewrite it each context switch. 1 cache line or less of data.

> not putting the pid in such a page. On platforms like the ARM and PA-RISC
> this is a big big issue, on X86 it sucks slightly.

This is of course a platform issue, and each platform may do it a
different way. On PA-RISC and ARM one can issue the system call,
if it cannot be implemented more efficiently.

>
> > 6) Kernel stacks must be fixed in size and aligned on 2-page boundaries.
>
> Depends on the platform. Most non x86 cpus have registers and we keep
> 'current' in a register. It makes for faster smaller code as well. To go
> over a 2 page boundary requires the stack is virtual not physical. At that
> point you can have holes in the stack to do alignment, you can do all sorts
> of tricks.

Again, the is platform dependent code and doing it for x86 doesn't imply
it has to be done the same way on other archs. Mips does have features
that make a PDA easily implemented, for example.

My proposal is aimed at ia32 platforms only at this time.

>
> > A solution to the above set of problems is to provide a fixed virtual mapping
> > to a pair of physical pages which are associated with each processor, and to
> > ensure that this mapping is used by a thread of execution when executing on
> > said processor.
>
> TLB flush per context switch. Thats a big lose. The magic page has to be
> constant, and globally mapped.

No, it cannot be globally mapped. It must be only accessible to a single
processor. This implies only that one page directory entry will differ
between two clones sharing an address space. It will cause a TLB flush
in some (perhaps boundary) cases of two threads in the same address
space are being switched between. I don't consider this typical (see
earlier post to Ingo Molnar).

In fact, if it could be arranged that the TLB entry is never flushed,
it would be good - since it will _always_ map the same physical page
for each processor forever. I think this can be done just by setting
the global bit in the pte, it will remain cached on each processor
(assuming it isn't flushed by another 4MB page) even though each
processor will have as different physical address in the tlb.

>
> > Pros:
> > 1) Decoupling the task_struct from the kernel stack makes many
> > enhancements of the kernel stack architecture possible including
> > but not limited to:
> >
> > 1) Larger kernel stacks
>
> We can do that anyway. Larger stacks are bad. 4000 processes at 8K a process
> is a lot of memory, more would suck.

Variable sized and autogrowth kernel stacks aren't bad. I'd argue that larger
stacks aren't necessarily bad, either. Depends on the system, workload,
drivers, etc.

The ability to, for example, allocate scatter-gather lists on the kernel
stack rather than call a memory allocator in the I/O path would be a
win. Likewise for the struct kiobuf.

>
> > 2) Variable sized kernel stacks
> > 3) Automatically growing kernel stacks.
>
> No. Your allocation might fail in an irq, or you might need to do a stack
> grow during a block I/O page out to allocate memory. Deadlock death.

Keep a reserved page per processor for this very eventuality. Take the
page back before dismissing the interrupt. Panic if the subsequent
guard page is hit (assuming a recursive loop error).

>
> > 3) Thread-private data can be implemented efficiently.
>
> It can already. The %esp trick is fast.

And has a number of shortcomings.

>
> > 4) Some system calls, especially those that return static data,
> > can be implemented very efficiently - not requiring a kernel
> > boundary crossing.
>
> At a huge cost. Static system calls can be handled by simply caching the
> values in the C library. Its silly to force tlb flushes for this.

True for getpid, that was a bad example on my part.
It is not necessarily true for gettimeofday(), times(),
kernel random number seed and other data frequently required
from the kernel; The kernel could even use it to export
code to user-mode, e.g. for signal handling.

Thanks for the comments,

scott lurndal
sgi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/