Quite honestly, the main place I have found O_DIRECT useful is in keeping programs doing large i/o quantities from blowing the buffers and making the other applications run like crap. If you application is running alone, unless you are very short of CPU or memory avoiding the copy to an o/s buffer will be down in the measurement noise.
I had a news (usenet) server which normally did 120 art/sec (~480 tps), which dropped to about 50 tps when doing large file copies even at low priority. By using O_DIRECT the impact essentially vanished, at the cost of the copy running about 10-15% slower. Changing various programs to use O_DIRECT only helped when really large blocks of data were involved, and only when i/o clould be done in a way to satisfy the alignment and size requirements of O_DIRECT.
If you upgrade to a newer kernel you can try other i/o scheduler options, default cfq or even deadline might be helpful.