Memory Managment idea to help move things to userspace

From: john moser
Date: Sun Dec 07 2003 - 01:21:45 EST

I've been told there's issues with transferring huge amounts of data between kernelspace and userspace. I think I may have a solution, if this problem does indeed exist. This would make things such as userspace netfiltering and userspace network filesystem drivers extremely feasable by effectively eliminating all data transfer overhead to and from the kernel in a secure and efficient manner.

PLEASE read and review this. If it is complete and utter braindamage and simply not possible, please explain why, and discard. If it IS good, then please plan to impliment in 2.7/2.8 (I realize I'm too late for 2.6). If it's close but not quite there, FIX IT. It's just an idea.

--Bluefox Icy

Linux.Net -->Open Source to everyone
Powered by Linare Corporation
user-kernel shared memory

THERE IS AN ISSUE with userspace programs communicating with the kernel, from
what I hear. This issue is that the transferrence of data to and from the
kernel incurrs a large amount of overhead. I have attempted to come up with a
halfway decent way to get around this, and I will state my results here.

FIRST LET ME ASSERT that I do not have any expert understanding of the slab
allocator OR of the TLB; the slab code confuses the hell out of me, and I don't
even remember what TLB stands for. Nonetheless, I ask that those of you who
do understand these things give this post, document, or whatever form it is
you look at this idea in fair review. If those of you who know how these
things work wish to assert that it is too much work, please supply logical
reasons why it won't. If you believe it may, please attempt to extend these
ideas into some workable form.

THE SLAB ALLOCATOR has several types of data it can allocate, including DMA
allocations--allocations used as targets for direct data transfer from
hardware such as hard disks to memory--and Atomic allocations--allocations
that must be fully allocated before the function can sleep. There is also
kernel and user memory.

THE VIRTUAL MEMORY system, as I understand it, depends on CPU microcode,
cache, and something in the CPU called TLB. As I gather, this TLB microcode
and cache is handled automatically by the CPU. When a page faults accross its
boundary--usually, a page loaded into CPU cache--the TLB microcode reads the
page_table cache inside the CPU cache, finds the next mapping, and loads it
from ram. The process sees this as a single, contiguous block of ram, even
though the process' virtual ramspace may be scattered accross ram or even not
even in ram but on disk in swap, in which case a page fault would NOT find
that entry in the TLB cache and would request the operating system to give it
that page_table entry, which would cause the OS to swap the page in.

WHAT I AM SUGGESTING is the addition of a new type of slab memory called
User-Kernel Shared Memory, or SLAB_UKSHM (GFP_UKSHM). This memory would ONLY
be created when specialized hook in the kernel is used, and would be mapped
into the process' ram space with page_table mappings, but also would be a part
of the kernel's private ramspace. In this way, if the process "walked" past
the end of the chunk of kernelspace ram allocated for its purpose, it would
wind up back in userspace, with no kernel-level checking needed to keep the
process from screeching accross kernel ram and destroying the kernel. The
major issue here is that this kind of page_table mapping must be noted down so
that it is NOT SWAPPED.

THE TYPICAL USAGE of this would be for immediate and direct communication of
data between the kernel and specialized userspace applications without copying
ram or using syscalls.

I WILL EXEMPLIFY THIS by discussing a possible reimplimentation of
netfilter/firewalling in userspace with UKSHM. I am not suggesting here that
iptables needs to be moved out of the kernel; I am suggesting that it is
possible to do so with the only performance hit and complications being that
of moving CODE from kernel space to userspace.

CURRENTLY WITH IPTABLES there are hooks in the kernel to allow a process with
proper access privilages (root owned?) to call certain system functions to
adjust how the kernel handles network connections and data. This requires
minimal data transfer between the kernel and the userspace program for
configuration only. After that, all processing and decision making is done
inside the kernel.

ANOTHER APPROACH TO THIS would be to move the decision making and processing
code out of the kernel, and the targets such as NAT's MASQUERADE, DROP,
ACCEPT, DENY, and FORWARD inside the kernel, possibly as modules. The targets
would exist exactly as they are now, in kernelspace; however, all examination
and decision would be done outside the kernel, and user-defined chains and
targets would exist outside the kernel.

THIS WOULD REQUIRE that the kernel transfer the network data to the process;
the process examine the data and make its decisions and alterations; and
finally that the process pass the altered data and the decisions back to the
kernel. As I understand it, this would currently require a lot of work. This
is where UKSHM comes into play.

WITH THE SUGGESTED ADDITION to the kernel's memory type and mm code, the
overhead of transferring the data back and forth could be evaded. The kernel
would simply negotiate an API with the program, and when needed it would map
out a page or pages of kernel ram to a free page or set of pages of the
userspace program's virtual ramspace (the program typically has 3G. This is
NOT a problem). Then it would call a function in the userspace program to
tell it where this ram is and what it wants done with it. I don't know if
the concept of a kernel calling a function in a program has been thought of
before, or if it is used now; if not, I will term this a "reverse syscall" for
simplicity's sake, because that's what it is.

UPON RECEIVING the reverse syscall, the program would perform its operations
on the data, then initiate a syscall to the kernel to acknowledge its finished
state. This should be nonblocking, as the program *may* crash (so may the
current implimentation), and we do NOT want that to take down the kernel (the
current implimentation will). The kernel will then process the passed
decision and unmap the UKSHM from the program, converting it directly to

IN THE EVENT that the connected process is KILLED, the kernel will in this
excorcise simply deny all networking (your firewall is dead, what the heck do
you want it to do?) until further notice.

THIS IS ONLY a suggestion, and a sketchy idea at that. However, I believe
that if I have NOT hit this directly on the head in a secure and highly
efficient manner, then I have at least come close, and those of you who know
better can nudge the idea the right way and come up with a viable solution.