Re: [patch, v2] aio: allocate kiocbs in batches

From: Andrew Morton
Date: Wed Sep 21 2011 - 17:40:33 EST


On Wed, 21 Sep 2011 13:16:00 -0400
Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:

> Hi,
>
> In testing aio on a fast storage device, I found that the context lock
> takes up a fair amount of cpu time in the I/O submission path. The
> reason is that we take it for every I/O submitted (see __aio_get_req).
> Since we know how many I/Os are passed to io_submit, we can preallocate
> the kiocbs in batches, reducing the number of times we take and release
> the lock. In my testing, I was able to reduce the amount of time spent
> in _raw_spin_lock_irq by .56% (average of 3 runs). The command I used
> to test this was:
> aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>
>
> I also tested the patch with various numbers of events passed to
> io_submit, and I ran the xfstests aio group of tests to ensure I didn't
> break anything.
>
> ...
>
> +/*
> + * struct kiocb's are allocated in batches to reduce the number of
> + * times the ctx lock is acquired and released.
> + */
> +#define KIOCB_BATCH_SIZE 32
> +struct kiocb_batch {
> + struct list_head head;
> + long total; /* number of requests passed to sys_io_submit */
> + long allocated; /* number of requests allocated so far */
> +};

I don't see a reason why `total' and `allocated' need to be 64-bit.
Making them 32-bit results in smaller code, smaller storage, smaller
d-cache footprint, etc.

Also, they should logically be unsigned types.

>
> ...
>
> +static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
> +{
> + int i;
> + int to_alloc, avail;
> + bool called_fput = false;
> + struct kiocb *req, *n;
> + struct aio_ring *ring;
> +
> + to_alloc = min(batch->total - batch->allocated, KIOCB_BATCH_SIZE);

And this generates a compile-time warning due to the long/int mismatch.
Did your compiler not warn here? (And why did `to_alloc' and `i' get
to be `int'? The type choices are chaotic in there!)

I'd suggest going with "unsigned" for `total' and `allocated', and make
KIOCB_BATCH_SIZE 32U. Then have a think about the appropriate types
for the derived locals such as `i', `to_alloc' and `avail'.

> + for (i = 0; i < to_alloc; i++) {
> + req = __aio_get_req(ctx);
> + if (!req)
> + /* allocation failed, go with what we've got */
> + break;
> + list_add(&req->ki_batch, &batch->head);
> + }
> +
> + if (i == 0)
> + goto out;
> +
> +retry:
> spin_lock_irq(&ctx->ctx_lock);
> - ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
> - if (ctx->reqs_active < aio_ring_avail(&ctx->ring_info, ring)) {
> + ring = kmap_atomic(ctx->ring_info.ring_pages[0]);
> +
> + avail = aio_ring_avail(&ctx->ring_info, ring) - ctx->reqs_active;
> + BUG_ON(avail < 0);
> + if (avail == 0 && !called_fput) {
> + /*
> + * Handle a potential starvation case. It is possible that
> + * we hold the last reference on a struct file, causing us
> + * to delay the final fput to non-irq context. In this case,
> + * ctx->reqs_active is artificially high. Calling the fput
> + * routine here may free up a slot in the event completion
> + * ring, allowing this allocation to succeed.
> + */
> + spin_unlock_irq(&ctx->ctx_lock);
> + kunmap_atomic(ring);

And there's a bug. We need to maintain the thread's atomic state
across the kunmap_atomic(). This should have caused a might_sleep()
runtime warning from kunmap_atomic()'s smp_processor_id() (at least).
That's assuming you tested on a 32-bit highmem box and were able to
exercise this codepath, neither of which seems likely ;)

> + aio_fput_routine(NULL);
> + called_fput = true;
> + goto retry;
> + }
> +
> + if (avail < i) {
> + /* Trim back the number of requests. */
> + list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
> + list_del(&req->ki_batch);
> + kmem_cache_free(kiocb_cachep, req);
> + if (--i <= avail)
> + break;
> + }
> + }
> +
> + batch->allocated += i;
> + list_for_each_entry(req, &batch->head, ki_batch) {
> list_add(&req->ki_list, &ctx->active_reqs);
> ctx->reqs_active++;
> - okay = 1;
> }
> - kunmap_atomic(ring, KM_USER0);
> - spin_unlock_irq(&ctx->ctx_lock);
>
> - if (!okay) {
> - kmem_cache_free(kiocb_cachep, req);
> - req = NULL;
> - }
> + kunmap_atomic(ring);
> + spin_unlock_irq(&ctx->ctx_lock);

Like that.

> - return req;
> +out:
> + return i;
> }
>
> ...
>

I wouldn't want to do the long->unsigned conversion without runtime
testing it so can you please do a v3?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/