Re: Questions about RAID and I/O scheduler

From: Corrado Zoccolo
Date: Fri Apr 02 2010 - 04:02:38 EST

Next message: Zhang, Yanmin: "Re: hackbench regression due to commit 9dfc6e68bfe6e"
Previous message: Peter Zijlstra: "Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accountingtill after load update period"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Mar 31, 2010 at 10:24 PM, Yuehai Xu <yuehaixu@xxxxxxxxx> wrote:
> Hi,
>
> I noticed that some one said NOOP is usually the default I/O scheduler
> for hardware RAID. Why not CFQ? Suppose there are just several
> sequential read processes, as I know CFQ will keep all the disk heads
> of hard raid to serve a process for a while(a time slice), in that
> case, CFQ should be the best of all I/O schedulers. Am I right?
> Generally, hard raid should have its own I/O scheduler in their
> firmware, in that case, the I/O scheduler of OS should do nothing
> except dispatch the requests as soon as possible, it is the hard raid
> itself to decide how to schedule these requests. From this point of
> view, NOOP should be the default one. I am really confused here.
Even single ncq disks have an I/O scheduler nowadays.
To clear the confusion, we should make a distinction between
work-conserving and non-work-conserving I/O schedulers.
* A work-conserving scheduler (e.g. deadline, noop) is idle only
when there is no request pending
* A non-work-conserving scheduler (e.g. CFQ, AS) may be idle at
any time, in an effort to improve request pattern locality or to
provide fairness.
A non-work-conserving scheduler in the host computer will generally
perform bad if the RAID also has a non-work-conserving scheduler,
because the decision to idle taken by the two schedulers could
conflict, causing disk utilization to drop needlessly. In that case,
NOOP or even better, deadline, could perform much better.

If the raid controller has a work-conserving I/O scheduler, instead
(single NCQ disks and cheap RAID cards typically have this kind of
schedulers), CFQ can effectively control the access pattern (by
queuing only the requests pertinent to the pattern and delaying the
others), and will take advantage when possible of the better
understanding of disk geometry by the lower level scheduler for some
kind of patterns (the ones for which we can submit multiple requests
in parallel, namely random access patterns).
In this case, we suggest to try CFQ and report if you see regressions
w.r.t. NOOP or deadline on some workloads, so we can tune it better.
>
> The next question is about the maximal number of disks in disk array,
> the fault tolerance should be one limitation because the more the
> number of disks, the higher chance of failure. However, may throughput
> also be one limitation? Do you know anyone use disk array which
> contains large number of disks to handle small requests? Such as 256
> disks to handle 4K requests?
You can use multiple disks to handle many parallel random requests.
You should check, though, the queue depth of your raid card, that
limits the actual number of requests issued in parallel.
If it is lower than the number of disks (e.g. it is 31 on SATA), then
the additional disks are wasted for random access patterns.

Thanks,
Corrado
>
> Thanks!
>
> Yuehai
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo@xxxxxxxxx
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Zhang, Yanmin: "Re: hackbench regression due to commit 9dfc6e68bfe6e"
Previous message: Peter Zijlstra: "Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accountingtill after load update period"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]