Re: I/O request ordering

Molnar Ingo (ingo@ws6200.hil.siemens.at)
Wed, 7 Aug 96 20:20:09 METDST


On Tue, 6 Aug 1996, Stephen C. Tweedie wrote:

> On Tue, 6 Aug 1996 11:55:46 +0300 (EET DST), Linus Torvalds
> <torvalds@cs.helsinki.fi> said:
>
> > On Mon, 5 Aug 1996, Ray Van Tassle-CRV004 wrote:
>
> >> For some time, I have had doubts about the I/O request ordering algorithm in
> >> drivers/block/ll_blk_rw. ...
>
> > If you _really_ want to improve performance, you should probably not use an
> > elevator. Instead, you should use something like "saw-tooth with a window",
> > where the "window" part is that the ordering algorithm will actually accept a
> > "backwards" request if it is within an acceptance window (of say 5% of the
> > disk or something like that). That can help a lot due to drive caching
> > (generally track caches).
>
> 5% seems an *awful* lot. There is a major problem with this sort of
> window, which is that fairness is sacrificed. It's not uncommon to
> have a lot of IO occurring at one part of the disk. The sawtooth
> guarantees that the request pointer will eventually move on to other
> requests. For example, take the case where there's a lot of syslog
> activity; you'll have a file being constantly written to disk via
> fsync, which can generate a lot of local write activity. You may
> starve the rest of the system if you don't limit the amount of
> backtracking allowed.

As addition, and as possible solution: a small theoretical analysis and
a proposed pseudocode algorithm.

Lets consider the following model for the disk-IO problem:

We have a statistical system, where 'requests' get added, and we
have to keep 'head movement' low. Requests can be reordered. Each
request has a 'position' and a 'size'. We have a boundary condition:
we have to process each request within a predefined amount of time.

Our task is to create dynamics for this system (an algorithm), which
satisfy the boudary condition and keep the 'head movement' function
at a minimum.

Even this simple model shows many interesting things:

Best would be if we waited forever, to get the lowest value for the
'head movement' function. In our case, this is not possible, we have
limited resources. A 'good method' is to 'simulate' an 'infinite
request queue'. This can be achieved by grouping requests into
intervals, and processing each interval as if they were independent
of each other. An independent interval should be processed by using
the best route method, this is simply the elevator algorithm over a
static request queue.

The 'interval cutter' can be implemented by using two queues, one is
getting filled up, and one is being processed. Or it can be implemented
by using a treshold value for filling up the requests queue, and doing
low water / high water mark hysteresis. The elevator method is partly
an 'interval creating' method. All these algorithms are equivalent
because they mix the best route solution with interval cutting.
Intervals can be created by using a timer, and cleaning the queue
regularly (without accepting further requests).

The sawtooth method breaks the boundary condition, it's not
guaranteed that all requests get processed. A request can get
'pushed forwards to infinity'. We need some kind of 'interval cutting'
to satisfy the boundary condition. (to force requests into fix-size
time-intervals)

If we extend the model by using a 'request timestamp' attribute,
'smooth' dynamics can be created which satisfy the boundary condition.
Based on the 'age' of a request, a 'request priority' can be calculated.
The route would be weighted by this priority. Basically, a high priority
request behaves just like many requests on the same place: they fill up
the limited queue, and the route >must< pass them after some time.
The sum of all priorities has to be kept constant. Sometimes this means
moving low priority requests to another 'waiting' queue. The basic
algorithm would stay unchanged: simply take the closest request.
The sum of priorities >has< to be constant, if we give it up, then we
break the boundary condition again.

So the key is to effectively mix the two dynamics: closest requests
algorithm which provides 'highest throughput', and the boundary
condition which provides 'guaranteed latency'.

If the priorities are calculated based on other things (like RT processes
or swapping priorities), then 'low latency' can be achieved too. 'Highest
throughput' leads to 'infinite latency', so we cant have max throughput.
The tradeof is controled by the priority function.

Here is the pseudocode for the 'smooth dynamics':

------------------
Data structures:

the request structure: position, size, priority
a request_tree of request pointers, ordered by 'position'.
a waiting_queue of request pointers
priority_tree of request pointers

each active request is in either in the request_tree, or in the
waiting queue, and is ordered by priority (priority ordering is
separate for request_tree and waiting_queue).

------------------
Operations:

add_request():
IF request_tree is full: trow out lower priority requests, or
put the request into the waiting queue
IF request_tree is not full: put the request into the ordered
request_tree. Cluster adjacent
requests if possible.
IF the device is idle, then trigger next_request().

update_priorities(): iterate the tree and update priorities,
throw out low priority requests to get
a constant sum of priorities.

next_request(): [driven by some "I'm ready" disk interrupt]
IF waiting_queue is not empty, then put the highest priority
request(s) into the request tree, and process the next
request. (the old request gets removed).

------------------
Events:

the system calls add_request()
an interrupt calls (or triggers) next_request()
a timer triggers update_priorities()

Comments more than welcome,

-- mingo

ps. note that in the special case of no requests, it is useful
to predict the next request. The simplest way is to put the head
to some average track value.

ps2. the update_priority function can be made quite heuristic:
swapping requests should get higher priorities, metadata should
get higher priority, RT requests should get higher priorities.
And old requests get higher priority with time, automatically.
Most of the process based IO request priorities could be
implemented by adding the process' priority to the IO priority,
if there is a process waiting for an IO request. If more processes
wait for an IO request, then their priority adds up.

ps3. the 'interval cutting' dynamics is equivalent to 'priority
based dynamics which has only two priorities: high and low'.

ps4. increasing the size of the request_tree makes reordering
better, but increases latency.

ps5. the waiting_queue is >only< for requests which got thrown
out of the request_tree. The device should be declared as full,
and a global OS waiting queue should be used for new blocking
requests. The waiting_queue is actively maintained by the
update_priority() operation, it's max size has to be kept
limited.