Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release
From: Luis Claudio R. Goncalves
Date: Thu Aug 22 2024 - 07:21:09 EST
On Wed, Aug 21, 2024 at 04:03:16PM -0400, Steven Rostedt wrote:
> On Tue, 20 Aug 2024 15:00:01 +0200
> tglozar@xxxxxxxxxx wrote:
>
> > From: Tomas Glozar <tglozar@xxxxxxxxxx>
> >
> > When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
> > NULL pointer dereference can be triggered by sending consequent SIGINT
> > and SIGTERM signals to the workload process. That then causes
> > timerlat_fd_release to be called twice in a row, and the second time,
> > hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
> > dereference.
> >
> > This can be reproduced using rtla:
> > ```
> > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
> > kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
> > [1] 1675
> > [1]+ Aborted (SIGTERM) rtla timerlat top -u -q
> > [1] 1688
> > client_loop: send disconnect: Broken pipe
> > ```
> > triggering the bug:
>
> I'm able to reproduce this with the above. Unfortunately, I can still
> reproduce it after applying this patch :-(
We were able to mitigate the problem (triggered by that command line) simply
by handling SIGTERM in the userspace tool the same way it handles SIGINT. That
was one of the factors that helped the "closing the file descriptor twice"
theory.
Would you mind sharing the backtrace you got? That would also help us
investigating.
> Looking at the code, the logic for handling the kthread seems off. I'll
> spend a little time to see if I can figure it out.
You mean the
+ if (!tlat_var->kthread) {
+ /* the fd has been closed already */
bit or the kthread handling in rtla itself?
As Tomas already said, thank you for testing and reviewing the suggested fix!
Luis