Re: [RFC][PATCH] Make io_submit non-blocking

From: Theodore Ts'o
Date: Tue Jul 24 2012 - 16:27:40 EST


On Tue, Jul 24, 2012 at 06:04:23PM +0530, Rajat Sharma wrote:
> >
> > Currently, io_submit tries to execute the io requests on the
> > same thread, which could block because of various reaons (eg.
> > allocation of disk blocks). So, essentially, io_submit ends
> > up being a blocking call.
>
> Ideally filesystem should take care of it e.g. by deferring such time
> consuming allocations and return -EIOCBQUEUED immediately. But have
> you seen such cases?

Oh, it happens all the time if you are using AIO. If the file system
needs to read or write any metadata block, AIO can become distinctly
non-"A". The workaround that I've chosen is to create a way to cache
the information needed for the bmap() operation, triggered via an
ioctl() issued at open time, so that this is not an issue, but that
only works if the file is pre-allocated, and there is no need to do
any block allocations.

It's all very well and good to say, "the file system should handle
it", but that just pushes the problem onto the file system. And since
you need to potentially issue block I/O requests, which you can't do
from an interrupt context (i.e., a block I/O completion handler), you
really do need to create a workqueue in order to make things work.

If you do it in the fs/direct_io.c layer, at least that way you can
solve the problem once for all file systems....

> With lots of application threads firing continuous IOs, workqueue
> threads might become bottleneck and you might have to eventually
> develop a priority scheduling. This workqueue was originally designed
> for IO retries which is an error path, now consumers of workqueue
> might easily increase by 100x.

Yes, you definitely need to throttle how many outstanding AIO's can be
allowed to be outstanding, either globally, or on a
per-superblock/process/user/cgroup basis, and return EAGAIN if there
are too many outstanding requests.

Speaking of cgroups, one of the other challenges with running the AIO
out of a workqueue is trying to respect cgroup restrictions. In
particular, the io-throttle cgroup (which is needed to provide
Proportional I/O support), but also the memory cgroup.

All of these complications is why I decided to simply go with the "pin
metadata" approach, since I didn't need to worry (at least initially)
with the allocating write case. (These patches to ext4 haven't yet
been published upstream, mainly because they need a lot of cleanup
work and I haven't had time to do that cleanup; my intention is to get
the "big extents" patchset upstream, though.)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/