Re: [PATCH] xfs: Wake CIL push waiters more reliably

From: Dave Chinner
Date: Thu Dec 31 2020 - 17:01:36 EST


On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> On 30.12.20 23:16, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > Threads, which committed items to the CIL, wait in the
> > > xc_push_wait waitqueue when used_space in the push context
> > > goes over a limit. These threads need to be woken when the CIL
> > > is pushed.
> > >
> > > The CIL push worker tries to avoid the overhead of calling
> > > wake_all() when there are no waiters waiting. It does so by
> > > checking the same condition which caused the waits to happen.
> > > This, however, is unreliable, because ctx->space_used can
> > > actually decrease when items are recommitted.
> >
> > When does this happen?
> >
> > Do you have tracing showing the operation where the relogged
> > item has actually gotten smaller? By definition, relogging in
> > the CIL should only grow the size of the object in the CIL
> > because it must relog all the existing changes on top of the new
> > changed being made to the object. Hence the CIL reservation
> > should only ever grow.
>
> I have (very ugly printk based) log (see below), but it only
> shows, that it happened (space_used decreasing), not what caused
> it.
>
> I only browsed the ( xfs_*_item.c ) code and got the impression
> that the size of a log item is rather dynamic (e.g. number of
> extends in an inode, extended attributes in an inode, continuity
> of chunks in a buffer) and wasn't surprised that a relogged item
> might need less space from time to time.
>
> > IOWs, returning negative lengths from the formatting code is
> > unexpected and probably a bug and requires further
> > investigation, not papering over the occurrence with broadcast
> > wakeups...
>
> One could argue that the code is more robust after the change,
> because it wakes up every thread which is waiting on the next push
> to happen when the next push is happening without making
> assumption of why these threads are waiting by duplicating code
> from that waiters side. The proposed waitqueue_active() is inlined
> to two instructions and avoids the call overhead if there are no
> waiters as well.

One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting that
their solution is "robust". :)

>
> # seq 29
>
> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650 ctx->space_used=33554656 , push_seq=29, ctx->sequence=29

So, at 20:08:15 we get a push trigger and the work is queued. But...

.....
> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake cil 00000000e374c6f1 ctx 000000004967d650 ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29

It takes the best part of *50 seconds* before the push work actually
runs?

That's .... well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...

> # seq 30
>
> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=33554480 , push_seq=30, ctx->sequence=30

20:09:39 for the next trigger,

> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30

So it dropped by 16 bytes (seems to be common) which is unexpected.
I wonder if it filled a hole in a buffer and so needed one less
xlog_op_header()? But then the size would have gone up by at least
128 bytes for the hole that was filled, so it still shouldn't go
down in size.

I think you need to instrument xlog_cil_insert_items() and catch
a negative length here:

/* account for space used by new iovec headers */
iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
len += iovhdr_res;
ctx->nvecs += diff_iovecs;

(diff_iovecs will be negative if the number of xlog_op_header
structures goes down)

And if this happens, then dump the transaction ticket via
xlog_print_trans(tp) so we can see all the log items types and
vectors that the transaction has formatted...

> 2020-12-29T20:10:22.507085+01:00 deadbird kernel: [ 1180.715657] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108892 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.507094+01:00 deadbird kernel: [ 1180.731757] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108876 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.659070+01:00 deadbird kernel: [ 1180.867812] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.771081+01:00 deadbird kernel: [ 1180.980187] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.791116+01:00 deadbird kernel: [ 1180.996535] XXX pushw cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:32.512085+01:00 deadbird kernel: [ 1190.725044] XXX no wake cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108856 < 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:32.528119+01:00 deadbird kernel: [ 1190.753321] XXX xc_push_wait ACTIVE!

Also, another 50s hold-off from push work being queued to the work
actually running. That also needs to be understood, because that's
clearly contributing to hitting the hard limit regularly and that
should mostly never happen....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx