Re: [PATCH] mm: sort freelist by rank number

From: Cho KyongHo
Date: Tue Aug 04 2020 - 06:28:35 EST


On Tue, Aug 04, 2020 at 11:12:55AM +0200, Vlastimil Babka wrote:
> On 8/4/20 4:35 AM, Cho KyongHo wrote:
> > On Mon, Aug 03, 2020 at 05:45:55PM +0200, Vlastimil Babka wrote:
> >> On 8/3/20 9:57 AM, David Hildenbrand wrote:
> >> > On 03.08.20 08:10, pullip.cho@xxxxxxxxxxx wrote:
> >> >> From: Cho KyongHo <pullip.cho@xxxxxxxxxxx>
> >> >>
> >> >> LPDDR5 introduces rank switch delay. If three successive DRAM accesses
> >> >> happens and the first and the second ones access one rank and the last
> >> >> access happens on the other rank, the latency of the last access will
> >> >> be longer than the second one.
> >> >> To address this panelty, we can sort the freelist so that a specific
> >> >> rank is allocated prior to another rank. We expect the page allocator
> >> >> can allocate the pages from the same rank successively with this
> >> >> change. It will hopefully improves the proportion of the consecutive
> >> >> memory accesses to the same rank.
> >> >
> >> > This certainly needs performance numbers to justify ... and I am sorry,
> >> > "hopefully improves" is not a valid justification :)
> >> >
> >> > I can imagine that this works well initially, when there hasn't been a
> >> > lot of memory fragmentation going on. But quickly after your system is
> >> > under stress, I doubt this will be very useful. Proof me wrong. ;)
> >>
> >> Agreed. The implementation of __preferred_rank() seems to be very simple and
> >> optimistic.
> >
> > DRAM rank is selected by CS bits from DRAM controllers. In the most systems
> > CS bits are alloated to specific bit fields in BUS address. For example,
> > If CS bit is allocated to bit[16] in bus (physical) address in two rank
> > system, all 16KiB with bit[16] = 1 are in the rank 1 and the others are
> > in the rank 0.
> > This patch is not beneficial to other system than the mobile devices
> > with LPDDR5. That is why the default behavior of this patch is noop.
>
> Hmm, the patch requires at least pageblock_nr_pages, which is 2MB on x86 (dunno
> about ARM), so 16KiB would be way too small. What are the actual granularities then?

16KiB is just an example. pageblock_nr_pages is 4Mb on both ARM and
ARM64. __perferred_rank() works if rank granule >= 4MB.

> >> I think these systems could perhaps better behave as NUMA with (interleaved)
> >> nodes for each rank, then you immediately have all the mempolicies support etc
> >> to achieve what you need? Of course there's some cost as well, but not the costs
> >> of adding hacks to page allocator core?
> >
> > Thank you for the proposal. NUMA will be helpful to allocate pages from
> > a specific rank programmatically. I should consider NUMA if rank
> > affinity should be also required.
> > However, page allocation overhead by this policy (page migration and
> > reclamation ect.) will give the users worse responsiveness. The intend
> > of this patch is to reduce rank switch delay optimistically without
> > hurting page allocation speed.
>
> The problem is, without some control of page migration and reclaim, the simple
> preference approach will not work after some uptime, as David suggested. It will
> just mean that the preferred rank will be allocated first, then the
> non-preferred rank (Linux will fill all unused memory with page cache if
> possible), then reclaim will free memory from both ranks without any special
> care, and new allocations will thus come from both ranks.
>
In fact, I did't considered about NUMA in that way. I now need to check
NUMA if it gives us the same result with this patch.

Thank you again for your comments about NUMA :)
NUMA