Re: [patch] get_request starvation fix

From: Andrew Morton (akpm@zip.com.au)
Date: Sat Feb 16 2002 - 02:32:25 EST

Next message: Vojtech Pavlik: "Re: small IDE cleanup: void * should not be used unless neccessary"
Previous message: Eric S. Raymond: "Re: Disgusted with kbuild developers"
Next in thread: Daniel Phillips: "Re: [patch] get_request starvation fix"
Reply: Daniel Phillips: "Re: [patch] get_request starvation fix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Marcelo Tosatti wrote:
>
> ...
> It seems the real gain (in latency) is caused by the FIFO behaviour.

Well there is no net gain in latency. What we've gained is *fairness*,
so the worst-case latency is improved. And yes, it's the FIFO behaviour
which provides that. And it is the exclusive wait which reduces the context
switch rate by 50%.

> That is, removing this hunk (against __get_request_wait())
>
> - if (q->rq[rw].count < batch_requests)
> + if (q->rq[rw].count == 0)
> schedule();
>
> Would not make _much_ difference latency-wise. I'm I right or missing
> something ?

Well the code which is presently in -rc1 is quite pointless. We
perform the above test a couple of microseconds after determining
that there are zero free requests. So the test you've illustrated
will *always* be true. The test for zero is the right thing to
do - it's just normal waitqueue handling. It's just a little
obfuscated by the implicit test-for-zero in get_request().

However, contrary to my earlier guess, the request batching does
make a measurable difference. Changing the code so that we wake up
a sleeper as soon as any request is freed costs maybe 30%
on `dbench 64'.

> Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
> the instrumentation stuff _and_ make it clear on the documentation that
> READA requests are not being used in practice).

READA is used in a few filesystems for directory readahead. (And since
dir-in-pagecache, ext2 is no longer performing directory readhead. Bad.)

I came >this< close to killing READA altogether. I believe it's a
misfeature. If we're under intense seek pressure, the very, very _last_
thing we want to do is to create more seeks by not performing readahead.
But given that the misfeature only applies to creation of new requests,
and that block-contiguous readahead will still work OK, it's not too
serious. Plus it's a stable kernel :)

--- linux-2.4.18-rc1/include/linux/blkdev.h Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h Fri Feb 15 23:06:04 2002
@@ -119,9 +119,9 @@ struct request_queue
         spinlock_t queue_lock;

         /*
- * Tasks wait here for free request
+ * Tasks wait here for free read and write requests
          */
- wait_queue_head_t wait_for_request;
+ wait_queue_head_t wait_for_requests[2];
};

struct blk_dev_struct {
--- linux-2.4.18-rc1/drivers/block/ll_rw_blk.c Wed Feb 13 12:59:09 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Fri Feb 15 23:09:17 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];

/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;

static inline int get_max_sectors(kdev_t dev)
{
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
                 q->rq[i&1].count++;
         }

- init_waitqueue_head(&q->wait_for_request);
+ init_waitqueue_head(&q->wait_for_requests[0]);
+ init_waitqueue_head(&q->wait_for_requests[1]);
         spin_lock_init(&q->queue_lock);
}

@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
         struct request *rq = NULL;
         struct request_list *rl = q->rq + rw;
@@ -438,40 +443,84 @@ static inline struct request *get_reques
}

/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with wakeup hysteresis around
+ * 0 .. batch_requests and with the assumption that request taking is much,
+ * much faster than request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking. Note that READA is used extremely rarely - a few
+ * filesystems use it for directory readahead.
+ *
+ * When a process wants a new request:
+ *
+ * b) If free_requests == 0, the requester sleeps in FIFO manner.
+ *
+ * b) If 0 < free_requests < batch_requests and there are waiters,
+ * we still take a request non-blockingly. This provides batching.
+ *
+ * c) If free_requests >= batch_requests, the caller is immediately
+ * granted a new request.
+ *
+ * When a request is released:
+ *
+ * d) If free_requests < batch_requests, do nothing.
+ *
+ * f) If free_requests >= batch_requests, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the batch_requests level,
+ * it will be able to take approximately (batch_requests) requests before
+ * blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
         register struct request *rq;
         DECLARE_WAITQUEUE(wait, current);

         generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
         do {
                 set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
                         schedule();
                 spin_lock_irq(&io_request_lock);
                 rq = get_request(q,rw);
                 spin_unlock_irq(&io_request_lock);
         } while (rq == NULL);
- remove_wait_queue(&q->wait_for_request, &wait);
+ remove_wait_queue(&q->wait_for_requests[rw], &wait);
         current->state = TASK_RUNNING;
         return rq;
}

-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;
-
- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
-}
-
/* RO fail safe mechanism */

static long ro_bits[MAX_BLKDEV][8];
@@ -546,7 +595,7 @@ static inline void add_request(request_q
/*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
         request_queue_t *q = req->q;
         int rw = req->cmd;
@@ -560,8 +609,9 @@ inline void blkdev_release_request(struc
          */
         if (q) {
                 list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
- wake_up(&q->wait_for_request);
+ if (++q->rq[rw].count >= batch_requests &&
+ waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
         }
}

@@ -742,22 +792,30 @@ again:
                         BUG();
         }

- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
         if (freereq) {
                 req = freereq;
                 freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
         }

/* fill up the request-info, and add it to the queue */

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Vojtech Pavlik: "Re: small IDE cleanup: void * should not be used unless neccessary"
Previous message: Eric S. Raymond: "Re: Disgusted with kbuild developers"
Next in thread: Daniel Phillips: "Re: [patch] get_request starvation fix"
Reply: Daniel Phillips: "Re: [patch] get_request starvation fix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sat Feb 23 2002 - 21:00:10 EST