Re: [PATCH v1] kernel: add a simple timer based software watchpoint
From: Thomas Gleixner
Date: Fri Jun 26 2026 - 05:21:10 EST
On Fri, Jun 26 2026 at 09:56, Feng Tang wrote:
> On Thu, Jun 25, 2026 at 11:30:55PM +0200, Thomas Gleixner wrote:
>> > ability to do the virtual to physical address translation instantly to
>> > watch a _physical_ address. So I guess, not able to watchpoint a physical
>> > address may be common for HW debuggers (I could be very wrong).
>>
>> If the hardware debugger and the underlying CPU facility (ETM on ARM64
>> IIRC) does not support triggers on physical addresses and you already
>> concluded from other information that the problem is in the BIOS, then
>> tracing the kernel with it's virt/phys translation is not going to
>> work. You obviously have to use the BIOS translation which might be very
>> different, no?
>
> I didn't explain the issue clearly. The order for solving this issue was,
> we first used this method to halt (while (1) dead loop) the system when
> detecting the memory corruption, silicon engineers gathered hardware
> traces, then root caused it. Before that, we didn't know it's a BIOS issue,
> as the initial symptom was random user space "segmentation fault"
Sure the initial symptom was a user space fault and you could not
explain it. But you really don't need your magic hack to figure out that
it tripped over a corrupted byte in the zero page or wherever.
Once you have that figured out and established that it's reproducible
then you add a watchpoint on that address in the kernel which won't
trigger. So that excludes the kernel and points to the BIOS, which in
turn makes you put a watchpoint on the BIOS translation.
If you need that hack to decode it, then you should rethink your
approach to structured problem analysis and deduction.
>> > As in https://lore.kernel.org/lkml/ajkuf08Cj0Se4P_0@U-2FWC9VHC-2323.local/,
>> > we also used this method to solve one issue that BIOS runtime service
>> > corrupting ACPI_ENABLE register issue.
>>
>> Again, if the BIOS runtime service changes virt/phys translation the you
>> have to trace the BIOS not the kernel. It's pretty obvious, no?
>
> Similarly, I didn't make it clear that the issue was not about address
> translation.
>
> The bug report I got from test engineers was, the ACPI_ENABLE register
> has right value from BIOS boot message, and after booting to OS, it was
> changed to a strange value. So initially the suspect was us OS guy :).
> And we used the 'approching" policy of the method, checked the kernel
> logs (we added many debug ones) before the corruption was detected, and
> found right before the corruption, there was a RTC runtime service
> calling record, and asked BIOS engineer to check, which root caused it.
>
> So the idea was to find the activites before the happening of "corrution",
> and check if there was some clues.
Again. You failed to structure the problem and use the tools correctly.
>> > Then I tried to recall some old memory corruption issues I've met before,
>> > and think about if there is some that could be captured by this method,
>> > one example was a static global array overflow issue, which corrupted
>> > some other global variables which was next to it in kernel bss segment.
>>
>> No. This is just all catching the problem after the fact with no trace
>> and conclusive information about the root cause. The tools are there,
>> you just have to use them correctly. But sure creating magic hacks which
>> by chance give you the same information is way better...
>
> This issue was interesting. It showed up as a NULL pointer panic, and I
> found it's a global variable (in bss segment) being corrupted (which shouldn't
> happen logically). As it didn't happened on normal platforms, but one platform
> with special config, we think it could be silicon related, and sent it to
> silicon team, who did root cause it with gathering/analyzing silicon traces to
> be an array overflow issue, as the special config make that array much longer.
Your debug war stories are amazing, but in the wrong way and do not
justify to shove a completely ill defined barely usable hack into the
kernel to be maintained forever.
Thanks,
tglx