Nonblocking buffered AIO from userspace

From: Milosz Tanski
Date: Thu May 23 2013 - 16:50:33 EST


Hi gang,

I need some advice on the best way to accomplish non-blocking buffered
disk IO from my user space application. Unlike some of the other
database systems I'm trying to outsource as much work to the kernel as
possible. I would prefer to avoid having to resolve to O_DIRECT and
io_submit to fetch the data and having to reimplement the page /
buffer cache & read ahead.

The application is read heavy with occasional long running write jobs.
Since I'm not too concerned about the performance on the write path I
am able to run that work in threads and block.

Current I'm mmaping the files, and the make the read path quite simple
and is great for disk scans when my data set is stored in memory. When
the data is not cached the performance becomes more unpredictable,
esp. when I'm doing an indexed read (giant bitmap indexes). Here's how
my IO path looks like:

application <--> fscache (SSD) <--> cephfs <--> ceph cluster

Ultimately what I'd like is a way to do non-blocking scatter gather IO
from disk or page cache into my application. I'd like to be
non-blocking because it often happens that I can do something useful
while waiting on IO like uncompress indexes for another request that
is waiting, process network IO., etc.

With mmap my blocking is unpredictable and mlock() blocks and only
lets me lock a range and not a vector of page ranges.

If I was doing this in the kernel life would be simple; there are all
sorts of APIs for doing async IO even when my VFS is stack as in above
diagram. Is there any way for me to take advantage of it in user
space... even if it's in units of page.

It's entirely possible that I'm missing something and there's a good
way of doing this that I haven't though of.

Thanks,
- Milosz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/