Re: [dm-devel] [4.4, 4.5, 4.6] Regression: encrypted swap (dm-crypt) freezes system while under memory pressure and swapping

From: Ondrej Kozina
Date: Thu May 05 2016 - 11:54:35 EST


On 04/21/2016 09:48 AM, Matthias Dahl wrote:
Hello @all,

first of all, I sent this exact msg also to the lkml a few days ago but
since I received no reaction, I thought this list might be a better
place for this problem -- or I might at least reach the right persons to
get this fixed/debugged/... . :-)

Recently I started seeing freezes while compiling bigger packages that
do require lots of memory (I use Gentoo).

The freezes where in the form that while in Xorg, the system would just
completely hang -- no magic sysrq keys, no mouse movement, nothing.
While in a terminal, one could still issue a magic sysrq command but it
would only echo the command itself but not execute it -- except for the
reboot command. So there was no way to get a backtrace or states or
anything alike.

After debugging this further, it became clear that the system always
froze when it started hitting the encrypted swap. It worked absolutely
fine as soon as you took the encryption out of the picture.

My setup then was: A 8 GiB swap on S/W-RAID5 for my 8 GiB physical ram
that was encrypted with dm-crypt and AES256-CBC-ESSIV.

I debugged this further and changed my setup to several swap partitions
on the physical disks w/o a RAID in-between to isolate the culprit. This
made no difference -- neither did switching ciphers and so forth.

Since this setup had worked for ages, I started looking into what had
changed the weeks before and noticed I had done several kernel upgrades.

To make a long story short, here my findings:

4.3.0, 4.4.0-final, 4.5-rc1 to 4.5-rc2:
No problems, except for the usual sluggishness with encrypted swap that
has been there since forever (it is like the encryption has the highest
priority and takes over the system, e.g. no terminal input is accepted
on a different terminal while high memory pressure is going on which is
in contrast with the encrypted swap, where this still works fine).

4.4.x, >= 4.5-rc3 (incl. 4.6-rcX and master):
The system freezes under memory pressure as soon as it starts swapping
out. 4.6 master is an exception here, it still responds to magic sysrq
commands properly but after some time though completely freezes hard.

I hadn't had the time to test all 4.3.x and 4.4.x releases, I am afraid.
What I can say though is that 4.4.6 is affected as well.

A git bisect between 4.5-rc2 and 4.5-rc3, lead me to the following commit:

564e81a57f9788b1475127012e0fd44e9049e342 is the first bad commit
commit 564e81a57f9788b1475127012e0fd44e9049e342
Author: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Fri Feb 5 15:36:30 2016 -0800

mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any
progress

This is obviously not the real culprit in my opinion but a trigger.
Reverting that commit on 4.5.1 for example, makes the encrypted swap
work flawlessly again (except for the usual system sluggishness).

Reverting it on 4.6 master@c3b46c73264b03000d1e18b22f5caf63332547c9,
does show a different picture though: The system freezes while the sysrq
keys do still work and usually recovers after some while if the
corresponding task that triggered the swapping in the first place, gets
killed. It sometimes does a bit of swapping, and sometimes don't while
it hangs there -- while usually with the other kernels in the "frozen"
state, the swapping stops completely.

I managed to get a bit more information out of 4.6 master though since
it sometimes recovers after quite some time and I can copy backtraces
and such to the disk, which I have attached.

I hope this helps in finding the real issue behind this. I am sorry I
could not provide more information but this has been a rather time
consuming task thus far. :-)

If there is anything else I can do to help or test, please let me know
and I will gladly do so.

Thanks in advance.

So long,
Matthias


Hello,

I second the observation that something is wrong and it doesn't seem to be related to dm-crypt target. My test setup is as follows:

2 CPUs
system ram: 1 GB
swap on top of dm-crypt: 2 GB

Whenever I start workload that consumes more memory than system ram but much less than total memory including the swap I end with following OOM message that I found to be premature and unexpected:
- https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg

the important snippet in-before the oom:

active_anon:4096kB inactive_anon:4636kB, writeback:4636kB

and also:

Free swap = 2039832kB
Total swap = 2097148kB

you can find more details in sample-* directories located in:
https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/

each sample directory contains stats collection taken approximately each second after I started the workload (it's a script from http://linux-mm.org/OOM site).

For me OOM killer message can be observed starting with this commit:
commit f9054c70d28bc214b2857cf8db8269f4f45a5e23
Author: David Rientjes <rientjes@xxxxxxxxxx>
Date: Thu Mar 17 14:19:19 2016 -0700

mm, mempool: only set __GFP_NOMEMALLOC if there are free elements

(...)

Kind regards
Ondrej