Re: [PATCH v1] kernel: add a simple timer based software watchpoint

From: Feng Tang

Date: Thu Jun 25 2026 - 21:58:56 EST

On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote:
> On Wed, Jun 24 2026 at 19:12, Feng Tang wrote:
> > On Wed, Jun 24, 2026 at 11:04:26AM +0200, Thomas Gleixner wrote:
> >> On Tue, Jun 23 2026 at 16:26, Feng Tang wrote:
> >> > On Mon, Jun 22, 2026 at 04:13:37PM +0200, David Hildenbrand (Arm) wrote:
> >> > As discussed in RFC patch review, this debug feature is similar to
> >> > soft/hard lockup detector and task-hung detector, should I make the control
> >>
> >> How is this very specialized ad hoc debug magic in any way similar to
> >> generally useful and just working debug mechanism like the lockup or
> >> hung detector? Those are just turned on, do not need a boatload of
> >> command line parameters and are generally useful.
> >
> > That's right. They are very useful and easy to use, as a big part
> > of my time is dealing with all kinds of lockup/task-hung bugs :)
> >
> >
> >> Your debug magic is a workaround for a disfunctional hardware debugger,
> >> which means it's going to be used by three people twice a year if at
> >> all. Seriously?
> >
> > The HW debugger is a Lautebach TRACE32 one, and may not be disfunctional,
> > as it works well for watchpoint virtual address and other daily job. My
> > own guess is that it can only see virtual address and doesn't have the
>
> Guessing is the worst engineering principle as I told you before.

Aha, indeed. Will reduce similar usage. thanks!

> > ability to do the virtual to physical address translation instantly to
> > watch a _physical_ address. So I guess, not able to watchpoint a physical
> > address may be common for HW debuggers (I could be very wrong).
>
> If the hardware debugger and the underlying CPU facility (ETM on ARM64
> IIRC) does not support triggers on physical addresses and you already
> concluded from other information that the problem is in the BIOS, then
> tracing the kernel with it's virt/phys translation is not going to
> work. You obviously have to use the BIOS translation which might be very
> different, no?

I didn't explain the issue clearly. The order for solving this issue was,
we first used this method to halt (while (1) dead loop) the system when
detecting the memory corruption, silicon engineers gathered hardware
traces, then root caused it. Before that, we didn't know it's a BIOS issue,
as the initial symptom was random user space "segmentation fault"

> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
> > we also used this method to solve one issue that BIOS runtime service
> > corrupting ACPI_ENABLE register issue.
>
> Again, if the BIOS runtime service changes virt/phys translation the you
> have to trace the BIOS not the kernel. It's pretty obvious, no?

Similarly, I didn't make it clear that the issue was not about address
translation.

The bug report I got from test engineers was, the ACPI_ENABLE register
has right value from BIOS boot message, and after booting to OS, it was
changed to a strange value. So initially the suspect was us OS guy :).
And we used the 'approching" policy of the method, checked the kernel
logs (we added many debug ones) before the corruption was detected, and
found right before the corruption, there was a RTC runtime service
calling record, and asked BIOS engineer to check, which root caused it.

So the idea was to find the activites before the happening of "corrution",
and check if there was some clues.

> > Then I tried to recall some old memory corruption issues I've met before,
> > and think about if there is some that could be captured by this method,
> > one example was a static global array overflow issue, which corrupted
> > some other global variables which was next to it in kernel bss segment.
>
> No. This is just all catching the problem after the fact with no trace
> and conclusive information about the root cause. The tools are there,
> you just have to use them correctly. But sure creating magic hacks which
> by chance give you the same information is way better...

This issue was interesting. It showed up as a NULL pointer panic, and I
found it's a global variable (in bss segment) being corrupted (which shouldn't
happen logically). As it didn't happened on normal platforms, but one platform
with special config, we think it could be silicon related, and sent it to
silicon team, who did root cause it with gathering/analyzing silicon traces to
be an array overflow issue, as the special config make that array much longer.

My thought was if I used this method, I could have found the corruption
happen right after the initialization of the module which has that array.

Thanks,
Feng

> > But yes, as you pointed out, the frequency is low (all of the 3 happened
> > in the past 6 months) for myself. And my wild guess is there could be
> > other developers that meet similar issues :)
>
> Can you for once have an informed opinion instead of wild guesses?
>
> Thanks,
>
> tglx