Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
From: Huang, Ying
Date: Thu Jun 13 2024 - 04:43:28 EST
Chris Li <chrisl@xxxxxxxxxx> writes:
> On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Chris Li <chrisl@xxxxxxxxxx> writes:
>>
>> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >>
>> >> Chris Li <chrisl@xxxxxxxxxx> writes:
>> >>
>> >
>> >> > In the page allocation side, we have the hugetlbfs which reserve some
>> >> > memory for high order pages.
>> >> > We should have similar things to allow reserve some high order swap
>> >> > entries without getting polluted by low order one.
>> >>
>> >> TBH, I don't like the idea of high order swap entries reservation.
>> > May I know more if you don't like the idea? I understand this can be
>> > controversial, because previously we like to take the THP as the best
>> > effort approach. If there is some reason we can't make THP, we use the
>> > order 0 as fall back.
>> >
>> > For discussion purpose, I want break it down to smaller steps:
>> >
>> > First, can we agree that the following usage case is reasonable:
>> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
>> > size than 4K and can have both better compress ratio and CPU
>> > performance gain.
>> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/
>> >
>> > So the goal is to make THP/mTHP have some reasonable success rate
>> > running in the mix size swap allocation, after either low order or
>> > high order swap requests can overflow the swap file size. The allocate
>> > can still recover from that, after some swap entries got free.
>> >
>> > Please let me know if you think the above usage case and goal are not
>> > reasonable for the kernel.
>>
>> I think that it's reasonable to improve the success rate of high-order
>
> Glad to hear that.
>
>> swap entries allocation. I just think that it's hard to use the
>> reservation based method. For example, how much should be reserved?
>
> Understand, it is harder to use than a fully transparent method, but
> still better than no solution at all. The alternative right now is we
> can't do it.
>
> Regarding how much we should reserve. Similarly, how much should you
> choose your swap file size? If you choose N, why not N*120% or N*80%?
> That did not stop us from having a swapfile, right?
>
>> Why system OOM when there's still swap space available? And so forth.
>
> Keep in mind that the reservation is an option. If you prefer the old
> behavior, you don't have to use the reservation. That shouldn't be a
> reason to stop others who want to use it. We don't have an alternative
> solution for the long run mix size allocation yet. If there is, I like
> to hear it.
It's not enough to make it optional. When you run into issue, you need
to debug it. And you may debug an issue on a system that is configured
by someone else.
>> So, I prefer the transparent methods. Just like THP vs. hugetlbfs.
>
> Me too. I prefer transparent over reservation if it can achieve the
> same goal. Do we have a fully transparent method spec out? How to
> achieve fully transparent and also avoid fragmentation caused by mix
> order allocation/free?
>
> Keep in mind that we are still in the early stage of the mTHP swap
> development, I can have the reservation patch relatively easily. If
> you come up with a better transparent method patch which can achieve
> the same goal later, we can use it instead.
Because we are still in the early stage, I think that we should try to
improve transparent solution firstly. Personally, what I don't like is
that we don't work on the transparent solution because we have the
reservation solution.
>>
>> >> that's really important for you, I think that it's better to design
>> >> something like hugetlbfs vs core mm, that is, be separated from the
>> >> normal swap subsystem as much as possible.
>> >
>> > I am giving hugetlbfs just to make the point using reservation, or
>> > isolation of the resource to prevent mixing fragmentation existing in
>> > core mm.
>> > I am not suggesting copying the hugetlbfs implementation to the swap
>> > system. Unlike hugetlbfs, the swap allocation is typically done from
>> > the kernel, it is transparent from the application. I don't think
>> > separate from the swap subsystem is a good way to go.
>> >
>> > This comes down to why you don't like the reservation. e.g. if we use
>> > two swapfile, one swapfile is purely allocate for high order, would
>> > that be better?
>>
>> Sorry, my words weren't accurate. Personally, I just think that it's
>> better to make reservation related code not too intrusive.
>
> Yes. I will try to make it not too intrusive.
>
>> And, before reservation, we need to consider something else firstly.
>> Whether is it generally good to swap-in with swap-out order? Should we
>
> When we have the reservation patch (or other means to sustain mix size
> swap allocation/free), we can test it out to get more data to reason
> about it.
> I consider the swap in size policy an orthogonal issue.
No. I don't think so. If you swap-out in higher order, but swap-in in
lower order, you make the swap clusters fragmented.
>> consider memory wastage too? One static policy doesn't fit all, we may
>> need either a dynamic policy, or make policy configurable.
>> In general, I think that we need to do this step by step.
>
> The core swap layer needs to be able to sustain mix size swap
> allocation free in the long run. Without that the swap in size policy
> is meaningless.
>
> Yes, that is the step by step approach. Allowing long run mix size
> swap allocation as the first step.
>
>> >> >> > Do you see another way to protect the high order cluster polluted by
>> >> >> > lower order one?
>> >> >>
>> >> >> If we use high-order page allocation as reference, we need something
>> >> >> like compaction to guarantee high-order allocation finally. But we are
>> >> >> too far from that.
>> >> >
>> >> > We should consider reservation for high-order swap entry allocation
>> >> > similar to hugetlbfs for memory.
>> >> > Swap compaction will be very complicated because it needs to scan the
>> >> > PTE to migrate the swap entry. It might be easier to support folio
>> >> > write out compound discontiguous swap entries. That is another way to
>> >> > address the fragmentation issue. We are also too far from that as
>> >> > right now.
>> >>
>> >> That's not easy to write out compound discontiguous swap entries too.
>> >> For example, how to put folios in swap cache?
>> >
>> > I propose the idea in the recent LSF/MM discussion, the last few
>> > slides are for the discontiguous swap and it has the discontiguous
>> > entries in swap cache.
>> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
>> >
>> > Agree it is not an easy change. The cache cache would have to change
>> > the assumption all offset are contiguous.
>> > For swap, we kind of have some in memory data associated with per
>> > offset already, so it might provide an opportunity to combine the
>> > offset related data structure for swap together. Another alternative
>> > might be using xarray without the multi entry property. , just treat
>> > each offset like a single entry. I haven't dug deep into this
>> > direction yet.
>>
>> Thanks! I will study your idea.
>>
>
> I am happy to discuss if you have any questions.
>
>> > We can have more discussion, maybe arrange an upstream alignment
>> > meeting if there is interest.
>>
>> Sure.
>
> Ideally, if we can resolve our differences over the mail list then we
> don't need to have a separate meeting :-)
>
--
Best Regards,
Huang, Ying