Re: [RFC][PATCH v2 00/31] timers: Use del_timer_shutdown() before freeing timers

From: Steven Rostedt
Date: Fri Oct 28 2022 - 14:50:00 EST


Trond,

I'm looking at a commit from 2005:

0f9dc2b16884b ("RPC: Clean up socket autodisconnect")

Cancel autodisconnect requests inside xprt_transmit() in order to avoid
races.
Use more efficient del_singleshot_timer_sync()


I'm working on adding a "shutdown" state to timers, making it required for
freeing the timer. This is to address the numerous bugs we hit where timers
get rearmed just before freeing and then cause a crash in the timer code,
without knowing what timer it was that caused it.

Having a specific shutdown state for timers will remove this problem
because if something tries to rearm a shutdown timer, it will fail and a
WARN_ON_ONCE() is triggered. See below in the "reply" part for a
description of this effort.

The reason for this email, is because that WARN_ON_ONCE() triggered on the
mod_timer() from:

static void
xprt_schedule_autodisconnect(struct rpc_xprt *xprt)
__must_hold(&xprt->transport_lock)
{
xprt->last_used = jiffies;
if (RB_EMPTY_ROOT(&xprt->recv_queue) && xprt_has_timer(xprt))
mod_timer(&xprt->timer, xprt->last_used + xprt->idle_timeout);
}

That's because xptr->timer was shutdown due to:

int
xprt_request_enqueue_receive(struct rpc_task *task)
{
[..]
/* Turn off autodisconnect */
del_singleshot_timer_sync(&xprt->timer);
return 0;
}

Now singleshot means just that. It's a single shot and calling
del_singleshot_timer_sync() will shut it down so that it can be freed. That
also means that it can no longer be re-armed.

I'm not sure what you meant by "Use more efficient del_singleshot_timer_sync()"
but I'm guessing since that was written in 2005, it is no longer relevant,
and del_timer_sync() should now be used.

After replacing that with del_timer_sync(), the warning goes away.

I just want to confirm that's OK with you.

Thanks,

-- Steve


On Thu, 27 Oct 2022 11:05:25 -0400
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> Back in April, I posted an RFC patch set to help mitigate a common issue
> where a timer gets armed just before it is freed, and when the timer
> goes off, it crashes in the timer code without any evidence of who the
> culprit was. I got side tracked and never finished up on that patch set.
> Since this type of crash is still our #1 crash we are seeing in the field,
> it has become a priority again to finish it.
>
> This is v2 of that patch set. Thomas Gleixner posted an untested version
> that makes timer->function NULL as the flag that it is shutdown. I took that
> code, tested it (fixed it up), added more comments, and changed the
> name to del_timer_shutdown() as Linus had asked. I also converted it to use
> WARN_ON_ONCE() instead of just WARN_ON() as Linus asked for that too.
>
> (Thomas, you never added a SoB, so I only added a link to your email
> in that commit. But as this will likely go through your tree anyway,
> I'm sure you'll have your SoB on all these).
>
> I then created a trivial coccinelle script to find where del_timer*()
> is called before being freed, and converted them all to del_timer_shutdown()
> (There was a couple that still used del_timer() instead of del_timer_sync()).
>
> I also updated DEBUG_OBJECTS_TIMERS to check from where the timer is ever
> armed, to calling of del_timer_shutdown(), and it will trigger if a timer
> is freed in between. The current way is to only check if the timer is armed,
> but that means it only triggers if the race condition is hit, and with
> experience, it's not run on enough machines to catch all of them. By triggering
> it from the time the timer is armed to the time it is shutdown, it catches
> all potential cases even if the race condition is not hit.
>
> I went though the result of the cocinelle script, and updated the locations.
> Some locations were caught by DEBUG_OBJECTS_TIMERS as the coccinelle script
> only checked for timers being freed in the same function as the del_timer*().
>
> V1 is found here: https://lore.kernel.org/all/20220407161745.7d6754b3@xxxxxxxxxxxxxxxxxx/
>
> Here's the original text of that version:
>
> [
> This is an RFC patch. As we hit a few bugs were del_timer() is called
> instead of del_timer_sync() before the timer is freed, and there could
> be bugs where even del_timer_sync() is used, but the timer gets rearmed,
> I decided to introduce a "del_timer_free()" function that can be used
> instead. This will at least educate developers on what to call before they
> free a structure that holds a timer.
>
> In this RFC, I modified hci_qca.c as a use case, even though that change
> needs some work, because the workqueue could still rearm it (I'm looking
> to see if I can trigger the warning).
>
> If this approach is acceptable, then I will remove the hci_qca.c portion
> from this patch, and create a series of patches to use the
> del_timer_free() in all the locations in the kernel that remove the timer
> before freeing.
> ]
>
> We are hitting a common bug were a timer is being triggered after it is
> freed. This causes a corruption in the timer link list and crashes the
> kernel. Unfortunately it is not easy to know what timer it was that was
> freed. Looking at the code, it appears that there are several cases that
> del_timer() is used when del_timer_sync() should have been.
>
> Add a del_timer_free() that not only does a del_timer_sync() but will mark
> the timer as freed in case it gets rearmed, it will trigger a WARN_ON. The
> del_timer_free() is more likely to be used by developers that are about to
> free a timer, then using del_timer_sync() as the latter is not as obvious
> to being needed for freeing. Having the word "free" in the name of the
> function will hopefully help developers know that that function needs to
> be called before freeing.
>
> The added bonus is the marking of the timer as being freed such that it
> will trigger a warning if it gets rearmed. At least that way if the system
> crashes on a freed timer, at least we may see which timer it was that was
> freed.
>