Re: S4 resume broken since 2.6.39 (3.1, too)

From: Takashi Iwai
Date: Wed Sep 28 2011 - 14:05:58 EST


At Wed, 28 Sep 2011 18:19:34 +0200,
Takashi Iwai wrote:
>
> At Wed, 28 Sep 2011 16:45:45 +0200,
> Takashi Iwai wrote:
> >
> > At Wed, 28 Sep 2011 16:45:48 +0200,
> > Rafael J. Wysocki wrote:
> > >
> > > On Wednesday, September 28, 2011, Takashi Iwai wrote:
> > > > At Wed, 28 Sep 2011 15:28:12 +0200,
> > > > Takashi Iwai wrote:
> > > > >
> > > > > At Wed, 28 Sep 2011 15:26:19 +0200,
> > > > > Rafael J. Wysocki wrote:
> > > > > >
> > > > > > On Wednesday, September 28, 2011, Takashi Iwai wrote:
> > > > > > > At Tue, 27 Sep 2011 18:56:44 +0200,
> > > > > > > Rafael J. Wysocki wrote:
> > > > > > > >
> > > > > > > > On Tuesday, September 27, 2011, Takashi Iwai wrote:
> > > > > > > > > At Mon, 26 Sep 2011 19:26:44 -0700,
> > > > > > > > > Yinghai Lu wrote:
> > > > > > > > > >
> > > > > > > > > > On 09/22/2011 11:11 AM, Takashi Iwai wrote:
> > > > > > > > > >
> > > > > > > > > > > At Thu, 22 Sep 2011 07:33:17 -0700,
> > > > > > > > > > > Yinghai Lu wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On Wed, Sep 21, 2011 at 11:48 AM, Rafael J. Wysocki <rjw@xxxxxxx> wrote:
> > > > > > > > > > >>> It looks like init_memory_mapping() is sometimes called with "end"
> > > > > > > > > > >>> beyond the last mapped PFN and it explodes when we try to write stuff to
> > > > > > > > > > >>> that address during image restoration.
> > > > > > > > > > >>>
> > > > > > > > > > >>> IOW, the Yinghai's assumption that init_memory_mapping() would always be
> > > > > > > > > > >>> called with a "good end" on x86_64 was overomptimistic.
> > > > > > > > > > >>
> > > > > > > > > > >> for 64bit x86, kernel_physical_mapping_init() will use
> > > > > > > > > > >> map_low_page()/call early_memmap() to access ram for page_table that is above
> > > > > > > > > > >> rather last mapped PFN.
> > > > > > > > > > >>
> > > > > > > > > > >> the point is:
> > > > > > > > > > >> on system with 64g, usable ram will be [0,2048m), [4g, 64g)
> > > > > > > > > > >> init_memory_mapping will be called two times for them.
> > > > > > > > > > >> before putting page_table high,
> > > > > > > > > > >> page table will be two parts: one is just below 512M, and one below 2048m.
> > > > > > > > > > >> after putting page_table high,
> > > > > > > > > > >> page table will be two parts: one is just below 2048M, and one below 64G.
> > > > > > > > > > >
> > > > > > > > > > > So, how can this change break S4 resume?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > not sure.
> > > > > > > > > >
> > > > > > > > > > seems resume has it's own page table during transition...
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Any hint for further debugging?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > you may try to insert dead loop in arch/x86/power/hibernate_asm_64.S::restore_image or core_restore_code
> > > > > > > > > >
> > > > > > > > > > to see which part cause reset.
> > > > > > > > >
> > > > > > > > > That's the answer I was afraid of :)
> > > > > > > >
> > > > > > > > I wonder. Does hibernation work on the machine in question with
> > > > > > > > acpi_sleep=nonvs and if so, is the problem reproducible with this
> > > > > > > > setting?
> > > > > > >
> > > > > > > It still shows the same problem. Reboots after a few S4 cycles.
> > > > > > > I'll test kernel S4.
> > > > > >
> > > > > > This most likely means that NVS is not involved.
> > > > >
> > > > > Indeed, kernel S4 also failed both with and without acpi_sleep=nonvs.
> > > > >
> > > > > > Have you tried to revert commit d1ee433 (x86, trampoline: Use the unified
> > > > > > trampoline setup for ACPI wakeup)?
> > > > >
> > > > > Not yet. I'll check it later.
> > > >
> > > > Unfortunately the commit can't be reverted cleanly.
> > > > The commits around it seem mixed up together.
> > > >
> > > > If you have a patch applicable to 3.0 or 3.1 kernel, let me know.
> > >
> > > No, I don't, but can you simply go back to that commit and see
> > > if the issue is reproducible with HEAD = d1ee433, and if it is,
> > > see if it still will be reproducible after going one commit back from there?
> >
> > Ah, that should be easy. I'll give it a try.
>
> It turned out to be non-trivial. At the commit d1ee4335
> [x86, trampoline: Use the unified trampoline setup for ACPI wakeup],
> I can build by cherry-picking the binutils fix commit 2ae9d293
> [x86: Fix binutils-2.21 symbol related build failures],
> then merge Yinghai's patch series, the commit d2137d5af.
> The test failed, as expected; S4 resume reboots.
>
> Then going back to the commit 4822b7fc
> [x86, trampoline: Common infrastructure for low memory trampolines]
> it's no longer buildable. This commit alone breaks too many things
> and requires the stuff in the commit above d1ee4335.
>
> Backing off one would go to 2.6.38-rc5. Hmm.
>
> Anyway I try to check whether 2.6.38-rc5 + merge d2137d5af would cause
> any issue.

Surprisingly this causes also the same problem.

If my previous test -- 2.6.37+Yinghai's patches didn't show the
problem -- is correct, it means that some change in 2.6.38 reacted
badly with Yinghai's patches, not about 2.6.39. I'll check tomorrow
again whether this observation is really correct.


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/