Re: [PATCH] xfs: introduce object readahead to log recovery

From: Dave Chinner
Date: Fri Jul 26 2013 - 07:35:34 EST


On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote:
> Dave,
>
> All comments are good to me, and will be applied to next version, thanks a lot.
>
> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@xxxxxxxxx wrote:
> >> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
> >>
> >> It can take a long time to run log recovery operation because it is
> >> single threaded and is bound by read latency. We can find that it took
> >> most of the time to wait for the read IO to occur, so if one object
> >> readahead is introduced to log recovery, it will obviously reduce the
> >> log recovery time.
> >>
> >> In dirty log case as below:
> >> data device: 0xfd10
> >> log device: 0xfd10 daddr: 20480032 length: 20480
> >>
> >> log tail: 7941 head: 11077 state: <DIRTY>
> >
> > That's only a small log (10MB). As I've said on irc, readahead won't
> Yeah, it is one 10MB log, but how do you calculate it based on the above info?

length = 20480 blocks. 20480 * 512 = 10MB....

> > And the recovery time from this is between 15-17s:
> >
> > ....
> > log device: 0xfd20 daddr: 107374182032 length: 4173824
> > ^^^^^^^ almost 2GB
> > log tail: 19288 head: 264809 state: <DIRTY>
> > ....
> > real 0m17.913s
> > user 0m0.000s
> > sys 0m2.381s
> >
> > And runs at 3-4000 read IOPs for most of that time. It's largely IO
> > bound, even on SSDs.
> >
> > With your patch:
> >
> > log tail: 35871 head: 308393 state: <DIRTY>
> > real 0m12.715s
> > user 0m0.000s
> > sys 0m2.247s
> >
> > And it peaked at ~5000 read IOPS.
> How do you know its READ IOPS is ~5000?

Other monitoring. iostat can tell you this, though I use PCP...

> > Ok, so you've based the readahead on the transaction item list
> > having a next pointer. What I think you should do is turn this into
> > a readahead queue by moving objects to a new list. i.e.
> >
> > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
> >
> > case XLOG_RECOVER_PASS2:
> > if (ra_qdepth++ >= MAX_QDEPTH) {
> > recover_items(log, trans, &buffer_list, &ra_item_list);
> > ra_qdepth = 0;
> > } else {
> > xlog_recover_item_readahead(log, item);
> > list_move_tail(&item->ri_list, &ra_item_list);
> > }
> > break;
> > ...
> > }
> > }
> > if (!list_empty(&ra_item_list))
> > recover_items(log, trans, &buffer_list, &ra_item_list);
> >
> > I'd suggest that a queue depth somewhere between 10 and 100 will
> > be necessary to keep enough IO in flight to keep the pipeline full
> > and prevent recovery from having to wait on IO...
> Good suggestion, will apply it to next version, thanks.

FWIW, I hacked a quick test of this into your patch here and a depth
of 100 brought the reocvery time down to under 8s. For other
workloads which have nothing but dirty inodes (like fsmark) a depth
of 100 drops the recovery time from ~100s to ~25s, and the iop rate
is peaking at well over 15,000 IOPS. So we definitely want to queue
up more than a single readahead...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/