Re: [PATCH v1] kernel: add a simple timer based software watchpoint

From: Feng Tang

Date: Fri Jun 26 2026 - 10:35:26 EST


On Fri, Jun 26, 2026 at 11:16:08AM +0200, Thomas Gleixner wrote:
> > Similarly, I didn't make it clear that the issue was not about address
> > translation.
> >
> > The bug report I got from test engineers was, the ACPI_ENABLE register
> > has right value from BIOS boot message, and after booting to OS, it was
> > changed to a strange value. So initially the suspect was us OS guy :).
> > And we used the 'approching" policy of the method, checked the kernel
> > logs (we added many debug ones) before the corruption was detected, and
> > found right before the corruption, there was a RTC runtime service
> > calling record, and asked BIOS engineer to check, which root caused it.
> >
> > So the idea was to find the activites before the happening of "corrution",
> > and check if there was some clues.
>
> Again. You failed to structure the problem and use the tools correctly.

Yes! For this one, we also concluded it's not OS changing it. As you said,
using HW jtag debugger to watchpoint that address for BIOS (usually they
use identity mapping, and the virtual address is the same as physical
address) should be able to capture the culprit BIOS code precisely.

>
> >> > Then I tried to recall some old memory corruption issues I've met before,
> >> > and think about if there is some that could be captured by this method,
> >> > one example was a static global array overflow issue, which corrupted
> >> > some other global variables which was next to it in kernel bss segment.
> >>
> >> No. This is just all catching the problem after the fact with no trace
> >> and conclusive information about the root cause. The tools are there,
> >> you just have to use them correctly. But sure creating magic hacks which
> >> by chance give you the same information is way better...
> >
> > This issue was interesting. It showed up as a NULL pointer panic, and I
> > found it's a global variable (in bss segment) being corrupted (which shouldn't
> > happen logically). As it didn't happened on normal platforms, but one platform
> > with special config, we think it could be silicon related, and sent it to
> > silicon team, who did root cause it with gathering/analyzing silicon traces to
> > be an array overflow issue, as the special config make that array much longer.
>
> Your debug war stories are amazing, but in the wrong way and do not
> justify to shove a completely ill defined barely usable hack into the
> kernel to be maintained forever.

Yes, I agree with you and David that this implementation is hacky and
difficult to maintain.

As HW jtag debugger is expensive and not generally available for
software developer, also many production hardware don't have their
jtag interface open for security reason, I think monitoring the
abnormal changes to dram/mmio is convenient for debugging memory
corruption issues. Could you help to give some suggestion on
how to redesign this? Many thanks!

- Feng

> Thanks,
>
> tglx