Re: [PATCH v1] kernel: add a simple timer based software watchpoint

From: Thomas Gleixner

Date: Thu Jun 25 2026 - 17:31:19 EST


On Wed, Jun 24 2026 at 19:12, Feng Tang wrote:
> On Wed, Jun 24, 2026 at 11:04:26AM +0200, Thomas Gleixner wrote:
>> On Tue, Jun 23 2026 at 16:26, Feng Tang wrote:
>> > On Mon, Jun 22, 2026 at 04:13:37PM +0200, David Hildenbrand (Arm) wrote:
>> > As discussed in RFC patch review, this debug feature is similar to
>> > soft/hard lockup detector and task-hung detector, should I make the control
>>
>> How is this very specialized ad hoc debug magic in any way similar to
>> generally useful and just working debug mechanism like the lockup or
>> hung detector? Those are just turned on, do not need a boatload of
>> command line parameters and are generally useful.
>
> That's right. They are very useful and easy to use, as a big part
> of my time is dealing with all kinds of lockup/task-hung bugs :)
>
>
>> Your debug magic is a workaround for a disfunctional hardware debugger,
>> which means it's going to be used by three people twice a year if at
>> all. Seriously?
>
> The HW debugger is a Lautebach TRACE32 one, and may not be disfunctional,
> as it works well for watchpoint virtual address and other daily job. My
> own guess is that it can only see virtual address and doesn't have the

Guessing is the worst engineering principle as I told you before.

> ability to do the virtual to physical address translation instantly to
> watch a _physical_ address. So I guess, not able to watchpoint a physical
> address may be common for HW debuggers (I could be very wrong).

If the hardware debugger and the underlying CPU facility (ETM on ARM64
IIRC) does not support triggers on physical addresses and you already
concluded from other information that the problem is in the BIOS, then
tracing the kernel with it's virt/phys translation is not going to
work. You obviously have to use the BIOS translation which might be very
different, no?

> As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
> we also used this method to solve one issue that BIOS runtime service
> corrupting ACPI_ENABLE register issue.

Again, if the BIOS runtime service changes virt/phys translation the you
have to trace the BIOS not the kernel. It's pretty obvious, no?

> Then I tried to recall some old memory corruption issues I've met before,
> and think about if there is some that could be captured by this method,
> one example was a static global array overflow issue, which corrupted
> some other global variables which was next to it in kernel bss segment.

No. This is just all catching the problem after the fact with no trace
and conclusive information about the root cause. The tools are there,
you just have to use them correctly. But sure creating magic hacks which
by chance give you the same information is way better...

> But yes, as you pointed out, the frequency is low (all of the 3 happened
> in the past 6 months) for myself. And my wild guess is there could be
> other developers that meet similar issues :)

Can you for once have an informed opinion instead of wild guesses?

Thanks,

tglx