On Wed, Dec 14, 2022 at 04:17:35PM -0800, Doug Berger wrote:
On 11/18/2022 9:05 AM, Mel Gorman wrote:
On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:
I was not familiar with page_alloc.shuffle, but it may very well have a role
to play here.
It almost certainly does because unlike zones or CMA, it affects how
free lists are arranged. IIRC, the original purpose was about improving
performance of high-speed direct-mapped cache but it also serves a
purpose in this case -- randomising allocations between two channels.
It's still not perfect interleaving but better than none.
A
major limitation of ZONE_MOVABLE is that there is no way of controlling
access from userspace to restrict the high-speed memory to a designated
application, only to all applications in general. The primary interface
to control access to memory with different characteristics is mempolicies
which is NUMA orientated, not zone orientated. So, if there is a special
application that requires exclusive access, it's very difficult to configure
based on zones. Furthermore, page table pages mapping data located in the
high-speed region are stored in the slower memory which potentially impacts
the performance if the working set of the application exceeds TLB reach.
Finally, while there is mention that Broadcom may have some special
interface to determine what applications can use the high-speed region,
it's hardware-specific as opposed to something that belongs in the core mm.
I agree that keeping the high-speed memory in a local node and using "sticky"
pageblocks or CMA has limitations of its own but in itself, that does not
justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
multiple controllers with equal distance and bandwidth (if I'm reading it
correctly) but places them in different zones.... that's just a bit weird if
there are no other addressing limitations. It's not obvious why ARM would do
that, but it also does not matter because it shouldn't be a core mm concern.
There appears to be some confusion regarding my explanation of multiple
memory controllers on a device like the BCM7278. There is no inherent
performance difference between the two memory controllers and their attached
DRAM. They merely provide the opportunity to perform memory accesses in
parallel for different physical address ranges. The physical address ranges
were selected by the SoC designers for reasons only known to them, but I'm
sure they had no consideration of zones in their decision making. The
selection of zones remains an artifact of the design of Linux.
Ok, so the channels are equal but the channels are not interleaved in
hardware so basically you are trying to implement software-based memory
channel interleaving?
I suppose that could be a fair characterization of the objective, though the
approach taken here is very much a "poor man's" approach that attempts to
improve things without requiring the "heavy lifting" required for a more
complete solution.
It's still unfortunate that the concept of zones being primarily about
addressing or capability limitations changes.
It's also difficult to use as
any user of it has to be very aware of the memory channel configuration of
the machine and know how to match addresses to channels. Information from
zoneinfo on start_pfns, spanned ranges and the like become less useful. It's
relatively minor but splitting the zones also means there is a performance
hit during compaction because pageblock_pfn_to_page is more expensive.
What is of interest to Broadcom customers is to better distribute user space
accesses across each memory controller to improve the bandwidth available to
user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with
1GB of memory on each memory controller will place the 1GB on the low
address memory controller in ZONE_DMA and the 1GB on the high address memory
controller in ZONE_NORMAL. With this layout movable allocation requests will
only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL
(high memory controller) is sufficiently depleted of free memory.
Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore
behavior does not improve this situation other than forcing more kernel
allocations off of the high memory controller. User space allocations are
even more likely to be on the high memory controller.
But it's a weak promise that interleaving will happen. If only a portion
of ZONE_MOVABLE is used, it might still be all on the same channel. This
might improve over time if enough memory was used and the system was up
for long enough.
The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be
located on the low memory controller to make it easier for user space
allocations to land on the low memory controller. If ZONE_MOVABLE is only
placed on the low memory controller then user space allocations can land in
ZONE_NORMAL on the high memory controller, but only through fallback after
ZONE_MOVABLE is sufficiently depleted of free memory which is just the
reverse of the existing situation. The Designated Movable Block mechanism
allows ZONE_MOVABLE memory to be located on each memory controller so that
user space allocations have equal access to each memory controller until the
ZONE_MOVABLE memory is depleted and fallback to other zones occurs.
To my knowledge Broadcom customers that are currently using the Designated
Movable Block mechanism are relying on the somewhat random starting and
stopping of parallel user space processes to produce a more random
distribution of ZONE_MOVABLE allocations across multiple memory controllers,
but the page_alloc.shuffle mechanism seems like it would be a good addition
to promote this randomness. Even better, it appears that page_alloc.shuffle
is already enabled in the GKI configuration.
The "random starting and stopping of parallel user space processes" is
required for the mechanism to work. It's arbitrary and unknown if the
interleaving happens where as shuffle has an immediate, if random, impact.
You are of course correct that the access patterns make all of the
difference and it is almost certain that one memory controller or the other
will be saturated at any given time, but the intent is to increase the
opportunity to use more of the total bandwidth made available by the
multiple memory controllers.
And shuffle should also provide that opportunity except it's trivial
to configure and only requires the user to know the memory channels are
not interleaved.
I experimented with a
Broadcom BCM7278 system with 1GB on each memory controller (i.e. 2GB total
memory). The buffers were made large to render data caching meaningless and
to require several pages to be allocated to populate the buffer.
With V3 of this patch set applied to a 6.1-rc1 kernel I observed these
results:
With no movablecore kernel parameter specified:
# time /tmp/thread_test
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
Thread 4 returns: 0
real 0m4.047s
user 0m14.183s
sys 0m1.215s
With this additional kernel parameter "movablecore=600M":
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
real 0m4.068s
user 0m14.402s
sys 0m1.117s
With this additional kernel parameter "movablecore=600M@0x50000000":
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
real 0m4.010s
user 0m13.979s
sys 0m1.070s
However, with these additional kernel parameters
"movablecore=300M@0x60000000,300M@0x320000000 page_alloc.shuffle=1":
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
real 0m3.173s
user 0m11.175s
sys 0m1.067s
What were the results with just
"movablecore=300M@0x60000000,300M@0x320000000" on its own and
page_alloc.shuffle=1 on its own?
For shuffle on its own, my expectations are that the results will be
variable, sometimes good and sometimes bad, because it's at the mercy of
the randomisation. It might be slightly improved if the initial top-level
lists were ordered "1, n, 2, n-1, 3, n-2" optionally in __shuffle_zone or
if shuffle_pick_tail was aware of the memory channels but a lot more work
to implement.