Re: IO queueing and complete affinity w/ threads: Some results

From: Alan D. Brunelle
Date: Thu Feb 14 2008 - 10:37:10 EST


Taking a step back, I went to a very simple test environment:

o 4-way IA64
o 2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs).
o Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses)

Basically:

o CPU 0 handled IRQs for /dev/sds
o CPU 2 handled IRQs for /dev/sdaa

We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements).

First: overall performance

2.6.24 (no patches) : 106.90 MB/sec

2.6.24 + original patches + rq=0 : 103.09 MB/sec
rq=1 : 98.81 MB/sec

2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
rq=1 : 107.16 MB/sec

So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses &c. Looking at the gross data:

Kernel CPU_CYCLES BACK END BUBBLES 100.0 * (BEB/CC)
-------------------------------- ----------------- ----------------- ----------------
2.6.24 (no patches) : 2,357,215,454,852 231,547,237,267 9.8%

2.6.24 + original patches + rq=0 : 2,444,895,579,790 242,719,920,828 9.9%
rq=1 : 2,551,175,203,455 148,586,145,513 5.8%

2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043 255,563,975,526 10.8%
rq=1 : 2,350,539,631,362 208,888,961,094 8.9%

For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs.

Combining %sys & %soft IRQ, we see:

Kernel % user % sys % iowait % idle
-------------------------------- -------- -------- -------- --------
2.6.24 (no patches) : 0.141% 10.088% 43.949% 45.819%

2.6.24 + original patches + rq=0 : 0.123% 11.361% 43.507% 45.008%
rq=1 : 0.156% 6.030% 44.021% 49.794%

2.6.24 + kthreads patches + rq=0 : 0.163% 10.402% 43.744% 45.686%
rq=1 : 0.156% 8.160% 41.880% 49.804%

The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok...

I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-)

I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?).

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/