Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list

From: Huang, Ying
Date: Mon Sep 09 2024 - 03:22:58 EST


Chris Li <chrisl@xxxxxxxxxx> writes:

> On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > BTW, what is your take on my previous analysis of the current SSD
>> > prefer write new cluster can wear out the SSD faster?
>>
>> No. I don't agree with you on that. However, my knowledge on SSD
>> wearing out algorithm is quite limited.
>
> Hi Ying,
>
> Can you please clarify. You said you have limited knowledge on SSD
> wearing internals. Does that mean you have low confidence in your
> verdict?

Yes.

> I would like to understand your reasoning for the disagreement.
> Starting from which part of my analysis you are disagreeing with.
>
> At the same time, we can consult someone who works in the SSD space
> and understand the SSD internal wearing better.

I think that is a good idea.

> I see this is a serious issue for using SSD as swapping for data
> center usage cases. In your laptop usage case, you are not using the
> LLM training 24/7 right? So it still fits the usage model of the
> occasional user of the swap file. It might not be as big a deal. In
> the data center workload, e.g. Google's swap write 24/7. The amount of
> data swapped out is much higher than typical laptop usage as well.
> There the SSD wearing out issue would be much higher because the SSD
> is under constant write and much bigger swap usage.
>
> I am claiming that *some* SSD would have a higher internal write
> amplification factor if doing random 4K write all over the drive, than
> random 4K write to a small area of the drive.
> I do believe having a different swap out policy controlling preferring
> old vs new clusters is beneficial to the data center SSD swap usage
> case.
> It come downs to:
> 1) SSD are slow to erase. So most of the SSD performance erases at a
> huge erase block size.
> 2) SSD remaps the logical block address to the internal erase block.
> Most of the new data rewritten, regardless of the logical block
> address of the SSD drive, grouped together and written to the erase
> block.
> 3) When new data is overridden to the old logical data address, SSD
> firmware marks those over-written data as obsolete. The discard
> command has the similar effect without introducing new data.
> 4) When the SSD driver runs out of new erase block, it would need to
> GC the old fragmented erased block and pertectial write out of old
> data to make room for new erase block. Where the discard command can
> be beneficial. It tells the SSD firmware which part of the old data
> the GC process can just ignore and skip rewriting.
>
> GC of the obsolete logical blocks is a general hard problem for the SSD.
>
> I am not claiming every SSD has this kind of behavior, but it is
> common enough to be worth providing an option.
>
>> > I think it might be useful to provide users an option to choose to
>> > write a non full list first. The trade off is more friendly to SSD
>> > wear out than preferring to write new blocks. If you keep doing the
>> > swap long enough, there will be no new free cluster anyway.
>>
>> It depends on workloads. Some workloads may demonstrate better spatial
>> locality.
>
> Yes, agree that it may happen or may not happen depending on the
> workload . The random distribution swap entry is a common pattern we
> need to consider as well. The odds are against us. As in the quoted
> email where I did the calculation, the odds of getting the whole
> cluster free in the random model is very low, 4.4E10-15 even if we are
> only using 1/16 swap entries in the swapfile.

Do you have real workloads? For example, some trace?

--
Best Regards,
Huang, Ying