Re: [BUG] Page allocation failures with newest kernels

From: Robin Murphy
Date: Tue May 31 2016 - 06:17:15 EST

On 31/05/16 04:02, Marcin Wojtas wrote:

After rebasing platform support of two different ARMv8 SoC's from v4.1
baseline to v4.4 it occurred that stressed systems tend to have page
allocation problems, related to creating new slabs:

Steps to reproduce:
- use SATA drive (on-board or over PCIe) with 2 btrfs 50G partitions
- run a couple of loops of following script:
mount /dev/sd${1}1 /mnt
mount /dev/sd${1}2 /mnt2
while [[ $i -lt ${2} ]]
echo -e "i = ${i}\n"
dd if=/dev/zero of=/mnt/3g bs=3M count=1024 &
dd if=/dev/zero of=/mnt/2g bs=2M count=1024 &
dd if=/dev/zero of=/mnt/1g bs=1M count=1024 &
dd if=/dev/zero of=/mnt2/2g bs=2M count=1024 &
dd if=/dev/zero of=/mnt2/1g bs=1M count=1024 &
dd if=/dev/zero of=/mnt2/3g bs=3M count=1024
let "i++"

The issue also reproduced on v4.6. Usually problems occur within first
iteration and then the rest is done without errors, also kernel remain
stable. I got an information, that page alloc problem were observed
also on Marvell ARMv7 platfrom (Armada38x).

I remember there were some issues around 4.2 with the revision of the arm64 atomic implementations affecting the cmpxchg_double() in SLUB, but those should all be fixed (and the symptoms tended to be considerably more fatal). A stronger candidate would be 97303480753e (which landed in 4.4), which has various knock-on effects on the layout of SLUB internals - does fiddling with L1_CACHE_SHIFT make any difference?


About the debug itself - after adding simplest possible trace in
trace/events/kmem.h (single argument u64 for counter or whatever kind
of number), it was shown both on v4.1 and v4.4 following condition is
achieved multiple times during test:
In __alloc_pages_nodemask(), during the test kernel jumps huge amount
of times (~250k times in v4.1 and ~570k in v4.4 per one script loop)
into following 'unlikely' condition:
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (unlikely(!page)) {
page = __alloc_pages_slowpath(alloc_mask, order, &ac);

The further difference is seen in __alloc_pages_slowpath().
warn_alloc_page() (routine responsible for printing page alloc failure
message) is reached via following condition:
if (!can_direct_reclaim) {
goto nopage;
In v4.1 ~5 times and in v4.4 ~40 times per one script loop.

Printing message however can be blocked by following condition in
if ((gfp_mask & _GFP_NOWARN) || !_ratelimit(&nopage_rs) ||
debug_guardpage_minorder() > 0)
Only first two are relevant. As ratelimit is derived directly from
CONFIG_HZ and this parameter differ between v4.1 and v4.4 (100 vs 250,
also CONFIG_SCHED_HRTICK is enabled only in v4.4) the configs were
swapped, but no change in behavior.

Also within 'faulty' revision there is a difference, depending on
filesystem used - with buildroot the dumps occur, but with same test
under ubuntu - it's impossible see the failure output (and it's not a
question of dmesg level:)). Comparing /proc/sys/vm contents didn't
show anything meaningful.

I tried to analyze changes around mm/ folder between v4.1 and v4.4
that may cause such difference, but wasn't able to find out what may
be causing the issue. Have anyone encountered such problems in recent
revisions? I would be very grateful for any hint or comment. Also if
any other data can be captured, please let know.

Best regards,
Marcin Wojtas

