Re: elevator code

From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 12 2000 - 17:09:11 EST


Rik van Riel wrote:
>
> Hi Jeff,
>
> since I vaguely remember an email from you describing how
> you spent <large amount of time> tweaking and changing
> the disk IO elevator in Netware, and since we might want
> to improve the Linux elevator sort a bit, could you give
> us some hints on what to do and where not to waste our
> time? ;)
>
> thanks,
>
> Rik

Rik,

Here's a complete description of the internal architecture of the
NetWare elevator and async disk I/O subsystem.

ARCHITECTURE OF THE NETWARE ELEVATOR AND DISK SUBSYSTEM
-------------------------------------------------------

The NetWare Disk subsystem is called the "NetWare Media Manager (MM)"
and supports all elevator, async I/O, and mirroring/hotfixing for the
NetWare operating system. It is logically defined as four separate
layers between the server file cache to the disk drivers as follows:

Layer 1

   NetWare File Cache/LRU

Layer 2

   MM Mirroring/Hotfixing/Segmentation Layer

Layer 3

   MM Elevator/Disk Heads

Layer 4

   NetWare Disk Drivers

LAYER 1
-------

In NetWare, all free memory in the entire system is owned by the file
cache, and the NetWare memory manager and VM agent actually sits on top
of the file cache. Although Linux is not implemented the same way, this
would be akin to Linux using the buffer cache set to a block size of 4K
as the "page pool" for the VM, kmalloc(), etc. The advantage to this
approach for Netware is clear since it's a file server. All the
available memory is used as LRU file cache, and applications alloc and
free from the file cache in NetWare if they need memory. All memory
allocation requests actually get free pages from the file cache itself,
which alleviates the need to balance LRU memory and app memory in the
server when one starves the other, which is a problem Linux has to deal
with today. In NetWare, the problem does not exist -- it's an LRU
policy how much memory apps can get access to. The file cache is hard
coded to 4K blocks in IA32, and other architectures NetWare has been
ported to like SPARC, MIPS, etc. emulate this 4K block model. This is
why NWFS on Linux uses the IO_BLOCK_SIZE defined set to 4K. It's also
why the NWFS LRU is hard coded to 4K blocks to emulate this
architecture.

The file cache is divided into four pools. Memory below 1MB (for DOS
and 20 bit DMA buffers and RPC call structures support), memory below
16MB (for 24 bit DMA disk drivers that need buffers below 16MB),
non-Movable Cache memory (memory that has been pinned to page tables,
gdt tables, OS Data and OS Code), and movable Cache memory (for app
memory for code and data, file data, VM and stack memory, etc.).
Movable Cache memory is used to cache file and data blocks for the
system.

When the NetWare file cache submits an I/O request to the mirroring
agent, however, these requests are always translated from 4K blocks into
sector relative offsets and the file cache in NetWare always views a
volume as a logical entity starting a sector 0....n and translates 4K
blocks into sector relative LOGICAL volume offset. For example, if a
file mapped to block 2 on volume, this would translate into 2 *
sectors_per_block(8) or a logical LBA offset of 16, which is identical
to how NT and W2K also maps NTFS volume I/O. It was done this way to
allow the File Cache the ability to perform I/O to the granularity of 1
sector up to the size of the disk (though it typically uses a cluster
size of 64K as max for native NetWare).

LAYER 2
-------

The MM Mirroring Agent is responsible for dupping I/O requests, and
translating logical volume LBAs into physical disk LBAs. NetWare
striping, mirroring, and segmentation are performed at this layer, and
it is at this layer that remapping of I/O LBA offsets to a particular
device occurs. In NetWare, disk I/O requests are generated by the File
Cache as logical volume LBA's, and passed to this layer. NetWare I/O
does not enforce fixed block sizes as is done by Linux. In NetWare, a
MM Disk I/O request is very simple and consists of <disk #, LBA start,
number of sectors, buffer pointer, callback function>. There is no
block aligned enforcement as exists in Linux and the MM interface allows
disk requests to start anywhere on the device and can be any length up
to the size of the disk. At this layer, NetWare maintains a map of
mirrored and segmented devices, and if it detects that a device has one
or more mirrored partners, the I/O request gets dupped and mapped for
each mirror that exists. The same appplies to volume segments. This
layer performs the remapping of I/O requests for volume segments that
may span disks. This layer should be viewed a a mapping agent and does
little else. The exceptions to this statement relate to hotfixing and
remirroring. This layer owns the remirror daemon, and if it detects a
mirrored partition is out of sync, it spawns the remirror process which
re-syncs partitions that are not in-sync, such as a disk that failed or
a new member being added to a previously existing mirror group. NetWare
alows up to 8 partitions to be mirrored with each other.

LAYER 3
--------

The layer beneath the mirroring agent is the MM elevator. The elevator
in NetWare is structured as a series of an A(queue) and B(de-queue)
queues for each disk that resides in the server. At this point,
things get a little complicated to understand, however. The design
decision to use two queues for each disk was chosen to solve the problem
of elevator starvation. In Linux, merging of I/O requests occurs at
this layer. In NetWare, oddly enough, the original implementation in 286
NetWare was very similiar, however, a newer model was adopted based on
lessons learned in live customer accounts. In NetWare, requests are
merged at A) the boundry between the File Cache and the I/O subsystem,
and B) in the drivers themselves and NOT THE ELEVATOR.

Disk Drivers in NetWare use two functions to service requests,
GetRequest() and PutRequest().
When an I/O request is submitted via PutRequest(), it is first placed on
the disk's A queue and NetWare drivers provide a poll() type function
that gets called by PutRequest() when an I/O request is first posted.
If the driver is already busy, it returns a BUSY status and at this
point the thread that inserted the request returns to the caller. If
the driver returns an idle status from poll(), the driver poll()
function will call GetRequest(). GetRequest() will first check the B
queue for a chain of disk requests, and if the B queue is empty, it will
take the entire list on the A queue, move to the B queue, zero the heads
on the A queue, and allow the driver to take one or more of the I/O
requests on the B queue. Disk completion interrupts will cause the disk
Drivers to call the callback post routine in each I/O request the disk
has completed, then the next thing the driver will do is call
GetRequest() until the B queue is completely empty. Once the B queue is
empty of requests, GetRequest() will again check the A queue for new
requests and move the list to the B queue and start processing the I/O
request chains all over again. If an incoming I/O is posted with
PutRequest() and the driver poll() returns BUSY and is servicing a chain
of I/O's on the B list, it will just drop the request on the A queue,
elevator index it in the chain and return.

This mechamisn completely avoids the problem of elevator starvation by
using an alternating A and B list. There are some optimizations that
occur here that allow the disk drivers to merge requests. In NetWare,
the disk drivers are designed as "smart" drivers, and it is inside the
drivers themselves that merging occurs. This is different than Linux.
Linux disk drivers are fairly "dumb" drivers by comparison. The design
decision to do it this way was based on study of different vendor cards,
and many of the boards themselves were found to have more "hints" about
how the intelligent merging of I/O requests should happen based on the
interface designs and subtle features that each hardware vendor could
instrument to make this process more efficient and provide faster I/O
bandwidth.

The elevator maintains a special set of linkages in the I/O request
chain, and if requests were found to be continguous, a special set of
links were instrumented that would give the drivers "hints" that a chain
of one or more I/O requests could be merged. The B queue was always
assumed to be owned exclusively by the disk driver, and the drivers
calls to GetRequest() were the actual mechanism that moved and
manipulated the I/O chains on the A and B queues. NetWare disk drives
that used this optimization would typically take the entire B queue in
one GetReuest() call if they were linked as contiguous, and issue a
single I/O operation in the hardware underneath. We tried doing what
Linux does and merge the requests above the drivers, but found that just
giving the driver the chain and allowing the driver to decide produced
higher performance numbers for certain boards, reduced processing
overhead, and memory usage. The NetWare File Cache also merged requests
as well, since MM I/O requests are not fixed blocks like Linux, but
varible length sector requests. If a 64K cluster needed to be written,
it was always sent as a single 64K I/O request.

LAYER 4
--------

Disk drivers in NetWare are described above. One important difference
is that in Linux disk I/O must be explicitly inititiated with a call to
run the tq_disk routine. In NetWare, submission of requests would kick
the A and B queues into the driver automatically without needing this
extra step. I understand why Linux did it this way -- to allow the
elevator to fill up. The NetWare model's use of an A and B list
circumvented the need for an external kicker by using this method.

I hope this explains some of it. Feel free to ask for more info.
People at times think I don't get it in Linux, but the fact is I do, I
just know it's pointless to argue or debate about stuff with folks since
there's a lot of passion involved, and insulting people's code will just
get a shotgun blast directed this way. I see Linux hitting a lot of the
same obstacles and problems I saw a decade ago working on NetWare, and
it's a temptation jump in and add my two cents, but I know unless folks
find out for themselves, they will ignore me a lot. I remember going to
NetScape in 1995 with Ty Mattingly (Ray Noorda's righ hand man and a
personal friend of Bill Gates), and sitting around a table with a bunch
of 26 year old millionaires while Ty tried to explain to them how
Microsoft was going to crush them into oblivion, and they discounted
every word he said. Netscape doesn't exist today.

:-)

Jeff Merkey
CEO, TRG

> --
> "What you're running that piece of shit Gnome?!?!"
> -- Miguel de Icaza, UKUUG 2000
>
> http://www.conectiva.com/ http://www.surriel.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Sep 15 2000 - 21:00:18 EST