Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support

From: Limonciello, Mario
Date: Mon Aug 21 2023 - 14:00:37 EST

Next message: Rafael J. Wysocki: "Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support"
Previous message: Mathieu Poirier: "Re: [PATCH v3] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings"
In reply to: Rafael J. Wysocki: "Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support"
Next in thread: Rafael J. Wysocki: "Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 8/21/2023 12:52 PM, Rafael J. Wysocki wrote:

On Mon, Aug 21, 2023 at 7:35 PM Limonciello, Mario
<mario.limonciello@xxxxxxx> wrote:

On 8/21/2023 12:29 PM, Rafael J. Wysocki wrote:

On Mon, Aug 21, 2023 at 7:17 PM Limonciello, Mario
<mario.limonciello@xxxxxxx> wrote:

On 8/21/2023 12:12 PM, Rafael J. Wysocki wrote:
<snip>

I was just talking to some colleagues about PHAT recently as well.

The use case that jumps out is "system randomly rebooted while I was
doing XYZ". You don't know what happened, but you keep using your
system. Then it happens again.

If the reason for the random reboot is captured to dmesg you can cross
reference your journal from the next boot after any random reboot and
get the reason for it. If a user reports this to a Gitlab issue tracker
or Bugzilla it can be helpful in establishing a pattern.

The below location may be appropriate in that case:
/sys/firmware/acpi/

Yes, it may. >

We already have FPDT and BGRT being exported from there.

In fact, all of the ACPI tables can be retrieved verbatim from
/sys/firmware/acpi/tables/ already, so why exactly do you want the
kernel to parse PHAT in particular?

It's not to say that /sys/firmware/acpi/PHAT isn't useful, but having
something internal to the kernel "automatically" parsing it and saving
information to a place like the kernel log that is already captured by
existing userspace tools I think is "more" useful.

What existing user space tools do you mean? Is there anything already
making use of the kernel's PHAT output?

I was meaning things like systemd already capture the kernel long
ringbuffer. If you save stuff like this into the kernel log, it's going
to be indexed and easier to grep for boots that had it.

And why can't user space simply parse PHAT by itself?
> There are multiple ACPI tables that could be dumped into the kernel
log, but they aren't. Guess why.

Right; there's not reason it can't be done by userspace directly.

Another way to approach this problem could be to modify tools that
excavate records from a reboot to also get PHAT. For example
systemd-pstore will get any kernel panics from the previous boot from
the EFI pstore and put them into /var/lib/systemd/pstore.

No reason that couldn't be done automatically for PHAT too.

I'm not sure about the connection between the PHAT dump in the kernel
log and pstore.

The PHAT dump would be from the time before the failure, so it is
unclear to me how useful it can be for diagnosing it. However, after
a reboot one should be able to retrieve PHAT data from the table
directly and that may include some information regarding the failure.

Right so the thought is that at bootup you get the last entry from PHAT
and save that into the log.

Let's say you have 3 boots:
X - Triggered a random reboot
Y - Cleanly shut down
Z - Boot after a clean shut down

So on boot Y you would have in your logs the reason that boot X rebooted.

Yes, and the same can be retrieved from the PHAT directly from user
space at that time, can't it?

Yes it can.

On boot Z you would see something about how boot Y's reason.

With pstore, the assumption is that there will be some information
relevant for diagnosing the failure in the kernel buffer, but I'm not
sure how the PHAT dump from before the failure can help here?

Alone it's not useful.
I had figured if you can put it together with other data it's useful.
For example if you had some thermal data in the logs showing which
component overheated or if you looked at pstore and found a NULL pointer
dereference.

IIUC, the current PHAT content can be useful. The PHAT content from
boot X (before the failure) which is what will be there in pstore
after the random reboot, is of limited value AFAICS.

Right, you would need to have the pstore logs from your bad boot and then the dmesg from your current (good) boot to get the info. And you're right at that point you could just run a userspace tool that gets the info instead.

I don't think any of this is necessary in the kernel, I just am describing the use case.

FWIW on the patch series IMO I think that the boots that don't show useful unexpected things (power button, cold boot, warm boot, cold reset) shouldn't be INFO either. I think these should default to debug, and just the unexpected ones should show up.\

Next message: Rafael J. Wysocki: "Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support"
Previous message: Mathieu Poirier: "Re: [PATCH v3] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings"
In reply to: Rafael J. Wysocki: "Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support"
Next in thread: Rafael J. Wysocki: "Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]