Re: The disappearing sys_call_table export.

From: Terje Eggestad (terje.eggestad@scali.com)
Date: Mon May 05 2003 - 06:23:19 EST


On Mon, 2003-05-05 at 12:25, Christoph Hellwig wrote:
> On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> > 1. performance is everything.
>
> then Linux is the wrong OS for you :)
>

Strangely enough not. You just have to try and stay out of the kernel as
much as possible ;-)

Of course some idiot sold the total-cost-of-ownership thingy of linux to
the customers. What they really need is a OS/360...
  
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application.
>
> I can't remember that the MPI spec tells anything about intercepting
> syscalls..
>

It's says quite a bit about what memory can be used for comm buffers.

> > 3b. the performance loss from copying from a receive area to the
> > userspace buffer is unacceptable.
> > 3c. It's therefore necessary for HW to access user pages.
> > 4. In order to to 3, the user pages must be pinned down.
> > 5. the way MPI is written, it's not using a special malloc() to allocate
> > the send receive buffers. It can't since it would break language binding
> > to fortran. Thus ANY writeable user page may be used.
>
> so use get_user_pages.

Let me clearify: pinning pages are not, repeat not a problem.

The problem occur when you
1. pinn a buffer
2. sbrk(-n) or munmap() (usually thru free()) the area the buffer
3. a new malloc() resulting in a sbrk(+n) or mmap()
4. then my new buffer has the exactly same virtual address as the prev.

(belive it or not this happens, and relatively frequently).
 
>
> > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > buffers every time they're used.
>
> Umm, pinning memory all the time means you get a bunch of nice DoS
> attachs due to the huge amount of memory.
>

This is HPC clusters. DoS is a non issue. This is not the normal multi
user systems. In fact you run one active process per CPU.

> > 7. The only way to cache buffers (to see if they're used before and
> > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > to a device file is prohibitive expensive under point 1.
>
> That's a horribly b0rked approach..
>

It's *FAST*.

> Again, where's your driver source so we can help you to find a better
> approach out of that mess?
>

The trace module we made to trace munmap() and sbrk() could be opened,
but you'll be disappointed since all the pinning ( get_user_pages() and
friends), send() recv() etc are in the drivers for the various hardware,
most of which are not our property.

The module works as follows. It catches sbrk(-arg) and munmap() and lays
out the trace info in a memory area mmap()'able thru the device file.
So when processes need the trace info they have the info in memory to
avoid doing a ioctl().

Thats all we need to know if a given virtual address needs to be
(re)pinned.

Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);

TJ

-- 
_________________________________________________________________________

Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed May 07 2003 - 22:00:21 EST