[Resend] Possible bug in __fragmentation_index()
From: Robert Harris
Date: Fri Feb 02 2018 - 09:17:03 EST
I was planning to annotate the opaque calculation in
__fragmentation_index() but on closer inspection I think there may be a
bug. I could use some feedback.
Firstly, for the case of fragmentation and ignoring the scaling,
__fragmentation_index() purports to return a value in the range 0 to 1.
Generally, however, the lower bound is actually 0.5. Here's an
illustration using a zone that I fragmented with selective calls to
__alloc_pages() and __free_pages --- the fragmentation for order-1 could
not be minimised further yet is reported as 0.5:
# head -1 /proc/buddyinfo
Node 0, zone DMA 1983 0 0 0 0 0 0 0 0 0 0
# head -1 /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 0.500 0.750 0.875 0.937 0.969 0.984 0.992 0.996 0.998 0.999
#
This is significant because 0.5 is the default value of
sysctl_extfrag_threshold, meaning that compaction will not be suppressed
for larger blocks when memory is scarce rather than fragmented. Of
course, sysctl_extfrag_threshold is a tuneable so the first question is:
does this even matter?
The calculation in __fragmentation_index() isn't documented but the
apparent error in the lower bound may be explained by showing that the
index is approximated by
F ~ 1 - 1/N
where N is (conceptually) the number of free blocks into which each
potential requested-size block has been split. I.e. if all free space
were compacted then there would be B free blocks of the requested size
where
B = info->free_pages/requested
and thus
N = info->free_blocks_total/B
The case of least fragmentation must be when all of the requested-size
blocks have been split just once to form twice as many blocks in the
next lowest free list. Thus the lowest value of N is 2 and the lowest
vale of F is 0.5. I readied a patch that, in essence, defined
F = 1 - 2/N and thereby set the bounds of __fragmentation_index() as
0 <= F < 1. Before sending it, I realised that, during testing, I *had* seen
the occasional instance of F < 0.5, e.g. F = 0.499. Revisting the
calculation, I see that the actual implementation is
F = 1 - [1/N + 1/info->free_blocks_total]
meaning that a very severe shortage of free memory *could* tip the
balance in favour of "low fragmentation". Although this seems highly
unlikely to occur outside testing, it does reflect the directive in the
comment above the function, i.e. favour page reclaim when fragmentation
is low. My second question: is the current implementation of F is
intentional and, if not, what is the actual intent?
The comments in compaction_suitable() suggest that the compaction/page
reclaim decision is one of cost but, as compaction is linear, this isn't
what __fragmentation_index() is calculating. A more reasonable argument
is that there's generally some lower limit on the fragmentation
achievable through compaction, given the inevitable presence of
non-migratable pages. Is there anything else going on?
Robert Harris