RE: [PATCH v2 0/5] ACPI / EC: Add reference counting for requests and cleans up the grace periods support.

From: Zheng, Lv
Date: Mon Feb 16 2015 - 01:42:19 EST


Hi,

> From: Wysocki, Rafael J
> Sent: Thursday, February 12, 2015 9:18 AM
>
> On 2/9/2015 3:23 AM, Zheng, Lv wrote:
> > Hi, Rafael
> >
> >> From: Rafael J. Wysocki [mailto:rjw@xxxxxxxxxxxxx]
> >> Sent: Friday, February 06, 2015 9:27 AM
> >>
> >> On Friday, February 06, 2015 08:57:37 AM Lv Zheng wrote:
> >>> This patchset contains 3 cleanups related to the EC driver:
> >>> 1. Command flushing (command grace period)
> >>> This patchset flushes EC commands before suspending/resuming, so that
> >>> there won't be timeout for the incomplete commands after resuming.
> >>> 2. Query flushing (query grace period)
> >>> This patchset flushes EC event queries before suspending/resuming, so
> >>> that there won't be broken events remained in the firmware queue.
> >>> 3. Command storming prevention
> >>> This patchset corrects command storming prevention logic because of
> >>> the GPE raw handler mode.
> >>> The request reference count debugging messages can be used to detect broken
> >>> EC transactions. It should always drop to 1 when the driver is idle during
> >>> the runtime.
> >>>
> >>> Note that after flushing before suspending, EC GPE is still enabled to keep
> >>> the old behavior.
> >>>
> >>> Lv Zheng (5):
> >>> ACPI/EC: Introduce STARTED/STOPPED flags to replace BLOCKED flag.
> >>> ACPI/EC: Add command flushing support.
> >>> ACPI/EC: Refine command storm prevention support.
> >>> ACPI/EC: Add query flushing support.
> >>> ACPI/EC: Add GPE reference counting debugging messages.
> >>>
> >>> drivers/acpi/ec.c | 295 ++++++++++++++++++++++++++++++++++++++++-------
> >>> drivers/acpi/internal.h | 1 +
> >>> 2 files changed, 254 insertions(+), 42 deletions(-)
> >> So this is on top of the EC patches you sent previously, right?
> > Yes.
> >
> > The sequence is:
> > ACPICA 20150204 release: http://www.spinics.net/lists/linux-acpi/msg55623.html
> > ACPI EC GPE race fixes: http://www.spinics.net/lists/linux-acpi/msg55633.html
> > And this series.
>
> Unfortunately, patch [4/5] from this series broke system suspend on Acer
> Aspire S5 for me.
>
> Apparently, the box hangs hard during the last stage of suspend (after
> offlining nonboot CPUs).
>
> The issue is 100% reproducible and reverting the patch (along with [5/5]
> that depends on it) fixes the problem, so I'm going to push the reverts
> to Linus on Friday.

GPE _Lxx need EC transactions to complete the GPE handling.
While during runtime, EC transactions need the EC GPE to be enabled to run in the interrupt mode.
During suspending, EC transactions will be running in polling mode.

Note that in the polling mode, originally we didn't check EC event (SCI_EVT).
This becomes an issue for Samsung platforms, so we check it even in the polling mode in advance_transaction():
if (status & ACPI_EC_FLAG_SCI)
acpi_ec_submit_detection(ec);
We need to take care of the old hidden logics by doing this improvement.

We originally have the following sequence for suspending:
acpi_disable_all_gpes()
EC GPE disabled
acpi_os_wait_events_complete()
Flush GPE work items (CPU0, including _Lxx)
_Lxx invokes EC transactions in the EC polling mode
Flush Notify work items (CPU0, including old QR_EC because QR_EC is also queued up using acpi_os_execute(), _Qxx, Notify)
---
we may still see EC transactions (excluding new QR_EC because SCI_EVT is not checked in polling mode) handled in polling mode submitted here.
---
acpi_ec_block_transactions()
Flush EC transactions (excluding QR_EC)
prevent further EC transactions (causing AE_BAD_PARAMETER for EC region accesses).

After applying all necessary improvements (excluding PATCH 4) we got the following suspend sequence:
acpi_disable_all_gpes()
EC GPE disabled
acpi_os_wait_events_complete()
Flush GPE work items (CPU0, including _Lxx)
_Lxx invokes EC transactions in the EC polling mode
Flush Notify work items (CPU0, including _Qxx, Notify, excluding old QR_EC because QR_EC is queued up in a separate workqueue)
---
we may still see EC transactions (including new QR_EC because SCI_EVT is checked in polling mode) handled in polling mode submitted here.
---
acpi_ec_block_transactions()
Flush EC transactions (including QR_EC)
prevent further EC transactions (causing AE_BAD_PARAMETER for EC region accesses).
The above sequence is very similar to old behavior. The only differences are caused by the QR_EC flushing order:
1. After acpi_os_wait_events_complete(), originally no new QR_EC or _Qxx evaluations while now there may be new QR_EC or _Qxx evaluations queued up.
2. Originally AE_BAD_PARAMETER can only be seen to non _Qxx issued EC transactions. While now we may also see this to be returned for _Qxx issued EC transactions.
So IMO, before root causing the issue, it's OK to revert just [PATCH 4].
And we can even add acpi_os_wait_events_complete() after acpi_ec_block_transactions(). So that it's really no difference.
I doubt this is useful, because _Qxx will run very quickly to an exceptional end and this can even be freezed.

After applying all necessary improvements (including PATCH 4) we got the following suspend sequence:
acpi_disable_all_gpes()
EC GPE disabled
acpi_os_wait_events_complete()
Flush GPE work items (CPU0, including _Lxx)
_Lxx invokes EC transactions in the GPE polling mode
Flush Notify work items (CPU0, including _Qxx, Notify, excluding old QR_EC)
---
we may still see EC transactions (including new QR_EC) handled in polling mode submitted here.
---
acpi_ec_block_transactions()
Flush EC transactions (including QR_EC and ensure QR_EC to be issued for SCI_EVT)
prevent further EC transactions.
The only difference is QR_EC is ensured to be issued after detecting SCI_EVT during suspending.

I didn't see issues in doing so unless the patch itself has issues.
If we see hang in suspend/resume test, in most cases, it should happen in acpi_ec_block_transactions().
Debugging this would require EC refcnt messages in the suspend kernel log to catch the breakage.
So [PATCH 5] will be helpful in debugging this issue.

The patch 4 might be lacking of test cases. Let me check it again.

> I can send you an acpidump output from the affected machine, but I won't
> be able to do any testing on it before Feb 23, so reverting is the only
> viable option for the time being.

No problem.

> Please let me know if you want the acpidump output.

Let me first check the patch again, in most of the cases, the bug could be reviewed out.
I can also try to reproduce this locally using other Acer modules, so please send me the .config and acpidump output.

Thanks and best regards
-Lv