Can you help diagnose a weird failed wake?

From: David Howells
Date: Thu Nov 19 2020 - 10:54:22 EST


Hi Peter, Paul,

Jeff Layton is seeing a weird failed wake in the fscache rewrite. Can you
have a look see if I've mucked up somewhere? For reference, Jeff's branch is
here:

https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git
ceph-fscache-iter-experimental

The situation is that the fscache_cookie struct has a member called 'stage'
that indicates the state of a cookie representing a data file. As a cache
object is set up, it goes through the LOOKING_UP stage (3) whilst the disk is
being queried to the ACTIVE stage (5) when the backing file has been opened -
at which point we can do I/O on it. Each time the stage is changed, we call
wake_up_var() to ping any waiters on the variable. Waiters use
wait_var_event() and co. to wait on it.

We've added some tracepoints. There's a function which begins an operation on
the cache. It waits for the cache to get into the appropriate stage as part
of that. Around the bit doing the wait in we have:

int __fscache_begin_operation(...)
{
...
wait_and_validate:
spin_unlock(&cookie->lock);
cookie->trace = true;
trace_fscache_stage_wait(cookie, stage, fscache_cookie_stage_wait_begin_op);
timeo = wait_var_event_timeout(&cookie->stage, cookie_stage_changed(cookie, stage), 10*HZ);
if (timeo <= 1) {
pr_warn("%s: cookie stage change wait timed out: cookie->stage=%u stage=%u",
__func__, READ_ONCE(cookie->stage), stage);
fscache_print_cookie(cookie, 'O');
trace_fscache_stage_wait(cookie, stage, fscache_cookie_stage_wait_fail);
} else {
trace_fscache_stage_wait(cookie, stage, fscache_cookie_stage_waited);
}
cookie->trace = false;
goto again;
...
}

and cookie_stage_changed() looks like:

static bool cookie_stage_changed(struct fscache_cookie *cookie, enum fscache_cookie_stage stage)
{
bool changed = (READ_ONCE(cookie->stage) != stage);

if (!changed)
trace_fscache_stage_wait(cookie, stage, fscache_cookie_stage_waiting);

return changed;
}

At the point the wakeup happens, we have:

void fscache_set_cookie_stage(struct fscache_cookie *cookie,
enum fscache_cookie_stage stage)
{
bool changed = false, trace;

spin_lock(&cookie->lock);
switch (cookie->stage) {
case FSCACHE_COOKIE_STAGE_INDEX:
case FSCACHE_COOKIE_STAGE_WITHDRAWING:
case FSCACHE_COOKIE_STAGE_RELINQUISHING:
break;
default:
trace = cookie->trace;
if (trace)
trace_fscache_stage_set(cookie, stage, fscache_cookie_stage_set_general);
cookie->stage = stage;
changed = true;
break;
}
spin_unlock(&cookie->lock);
if (changed) {
wake_up_var(&cookie->stage);
if (trace)
trace_fscache_stage_set(cookie, stage, fscache_cookie_stage_woken);
}
}

In dmesg, we see:

[ 2977.269392] FS-Cache: __fscache_begin_operation: cookie stage change wait timed out: cookie->stage=5 stage=3
[ 2977.269398] FS-Cache: O-cookie c=000dd7d9 [p=000dce05 fl=0 nc=0 na=2]
[ 2977.276673] FS-Cache: O-cookie d=CEPH.in
[ 2977.277796] FS-Cache: O-cookie o=201635
[ 2977.278552] FS-Cache: O-key=[16] '352b930000010000feffffffffffffff'

In the trace log, we see:

<...>-425461 [015] .... 2967.107716: fscache_stage_wait: c=000dd7d9 WAIT beginop now=3 arg=3
kfsc/7-896 [005] .... 2967.107717: fscache_stage_set: c=000dd7d9 SET general now=3 arg=5
<...>-425461 [015] .... 2967.107718: fscache_stage_wait: c=000dd7d9 WAITING now=3 arg=3
<...>-425461 [015] .... 2967.107718: fscache_stage_wait: c=000dd7d9 WAITING now=5 arg=3
kfsc/7-896 [005] .... 2967.107719: fscache_stage_set: c=000dd7d9 WOKEN now=5 arg=5
<...>-425461 [015] .N.. 2977.452925: fscache_stage_wait: c=000dd7d9 WAIT failed now=5 arg=3

So we can see the wait beginning, with the stage at LOOKING_UP (3). Note that
we're actually waiting for stage!=3.

Next, on cpu 5, we're about to set the stage ("SET" line), and then we've
completed the wakeup 2uS later ("WOKEN" line).

Whilst the stage-change is happening, the wait condition check happens twice
in quick succession (two "WAITING" lines). The first is presumably right at
the beginning of wait_var_event() and the second is inside the wait loop. In
the first check, the stage is still 3, but in the second call of the condition
function, it reads it as 3 (still LOOKING_UP), and so meets the condition to
do the tracepoint - which then sees it as 5 (ACTIVE), but this doesn't affect
the result of the condition function.

[*] Note that in the traceline, cookie->stage as read inside the tracepoint is
the "now" number and the second is fed in by the caller.

wait_var_event()/wait_var_event_timeout() then goes to sleep - which it
shouldn't, because either the task state should've been cleared, or the
condition function should prevent it.

Any thoughts on how to debug this further? Or if I've just done something
silly?

Also, do I need to use WRITE_ONCE() or smp_store_release() when setting
cookie->store, since I'm reading it outside of the spinlock, albeit through
READ_ONCE().

Thanks,
David