Re: erofs pointer corruption and kernel crash
From: Arseniy Krasnov
Date: Mon Apr 27 2026 - 10:52:11 EST
26.04.2026 14:42, Arseniy Krasnov wrote:
> 25.04.2026 18:29, Gao Xiang пишет:
>> Hi Arseniy,
>>
>> On 2026/4/13 15:20, Arseniy Krasnov wrote:
>>>
>>> 13.04.2026 10:08, Gao Xiang пишет:
>>>>
>>>> On 2026/4/11 23:10, Arseniy Krasnov wrote:
>>>>>
>>>>> 10.04.2026 18:41, Gao Xiang пишет:
>>>>>> Hi Arseniy,
>>>>>>
>>>>>> On 2026/4/10 21:27, Arseniy Krasnov wrote:
>>>>>>>
>>>>>>> 10.04.2026 15:20, Gao Xiang пишет:
>>>>>>>>
>>>>>>>> On 2026/4/10 19:37, Arseniy Krasnov wrote:
>>>>>>>>
>>>>>>>> (drop unrelated folks since they all subscribed erofs mailing list)
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 10.04.2026 11:31, Gao Xiang wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 2026/4/10 16:13, Arseniy Krasnov wrote:
>>>> ...
>>>>
>>>>>>>>>> I need more informations to find some clues.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So reproduced again with this debug patch which adds magic to 'struct z_erofs_pcluster' and prints 'struct folio'
>>>>>>>>> when pointer in 'private' is passed to 'erofs_onlinefolio_end()'. In short - 'private' points to 'struct z_erofs_pcluster'.
>>>>>>>> First, erofs-utils 1.8.10 doesn't support `-E48bit`:
>>>>>>>> only erofs-utils 1.9+ ship it as an experimental
>>>>>>>> feature, see Changelog; so I think you're using
>>>>>>>> modified erofs-utils 1.8.10:
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/ChangeLog
>>>>>>>>
>>>>>>>> ```
>>>>>>>> erofs-utils 1.9
>>>>>>>>
>>>>>>>> * This release includes the following updates:
>>>>>>>> - Add 48-bit layout support for larger filesystems (EXPERIMENTAL);
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Second, I'm pretty sure this issue is related to
>>>>>>>> experimenal `-E48bit`, and those information is
>>>>>>>> not enough for me to find the root cause, so I
>>>>>>>> need to find a way to reproduce myself: It may
>>>>>>>> take time; you could debug yourself but I don't
>>>>>>>> think it's an easy task if you don't quite familiar
>>>>>>>> with the EROFS codebase.
>>>>>>>>
>>>>>>>> Anyway I really suggest if you need a rush solution
>>>>>>>> for production, don't use `-E48bit + zstd` like
>>>>>>>> this for now: try to use other options like
>>>>>>>> `-zzstd -C65536 -Efragments` instead since those
>>>>>>>> are common production choices.
>>>>>>> Ok thanks for this advice! One more question: currently we use this options:
>>>>>>> "zstd,22 --max-extent-bytes 65536 -E48bit". Ok we remove "zstd,22" and "E48bit",
>>>>>>> but what about "--max-extent-bytes 65536" - is it considered stable option?
>>>>>>> Or it is better to use your version: "-zzstd -C65536 -Efragments" ?
>>>>>> I'm not sure how you find this
>>>>>> "zstd,22 --max-extent-bytes 65536 -E48bit" combination.
>>>>>>
>>>>>> My suggestion based on production is that as long as
>>>>>> you don't use `-zzstd` ++ `-E48bit`, it should be fine.
>>>>>>
>>>>>> If you need smaller images, I suggest: `-zlzma,9 -C65536 -Efragments`
>>>>>> Or like Android, they all use `-zlz4hc`,
>>>>>> Or zstd, but don't add `-E48bit`.
>>>>>>
>>>>>> As for "--max-extent-bytes 65536", it can be dropped
>>>>>> since if `-E48bit` is not used, it only has negative
>>>>>> impacts.
>>>>>>
>>>>>> In short, `-E48bit` + `-zzstd` + `--max-extent-bytes`
>>>>>> enables new unaligned compression for zstd, but it's
>>>>>> a relatively new feature, I still still some time to
>>>>>> stablize it but my own time is limited and all things
>>>>>> are always prioritized.
>>>>> Ok, thanks for this advice!
>>>> FYI, I can reproduce this issue locally with `-E48bit`
>>>> on in 600s.
>>>>
>>>> I do think it's a `-E48bit` + zstd issue so
>>>> non-`-E48bit` won't be impacted and I will find time
>>>> to troubleshoot it this week.
>>> Yes, without '-E48bit' we also can't reproduce it for entire weekend on several boards. No such panics.
>> Can you check if the following informal patch resolves
>> this issue? I've checked it locally:
>>
>> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
>> index 8a0b15511931..824ffe4b871c 100644
>> --- a/fs/erofs/zdata.c
>> +++ b/fs/erofs/zdata.c
>> @@ -1509,12 +1509,6 @@ static void z_erofs_fill_bio_vec(struct bio_vec *bvec,
>> DBG_BUGON(z_erofs_is_shortlived_page(bvec->bv_page));
>>
>> folio = page_folio(zbv.page);
>> - /* For preallocated managed folios, add them to page cache here */
>> - if (folio->private == Z_EROFS_PREALLOCATED_FOLIO) {
>> - tocache = true;
>> - goto out_tocache;
>> - }
>> -
>> mapping = READ_ONCE(folio->mapping);
>> /*
>> * File-backed folios for inplace I/Os are all locked steady,
>> @@ -1527,6 +1521,12 @@ static void z_erofs_fill_bio_vec(struct bio_vec *bvec,
>> return;
>> }
>>
>> + if (cmpxchg(&folio->private, Z_EROFS_PREALLOCATED_FOLIO, NULL) ==
>> + Z_EROFS_PREALLOCATED_FOLIO) {
>> + tocache = true;
>> + goto out_tocache;
>> + }
>> +
>> folio_lock(folio);
>> if (likely(folio->mapping == mc)) {
>> /*
>> @@ -1546,14 +1546,8 @@ static void z_erofs_fill_bio_vec(struct bio_vec *bvec,
>> }
>> return;
>> }
>> - /*
>> - * Already linked with another pcluster, which only appears in
>> - * crafted images by fuzzers for now. But handle this anyway.
>> - */
>> - tocache = false; /* use temporary short-lived pages */
>> } else {
>> - DBG_BUGON(1); /* referenced managed folios can't be truncated */
>> - tocache = true;
>> + DBG_BUGON(1); /* referenced managed folios can't be truncated */
>> }
>> folio_unlock(folio);
>> folio_put(folio);
>>
>>
>> I will form a formal patch later with comments and commit
>> message later.
>
> Hi, thanks! I'll test it!
Just tested this patch. Looks like problem is fixed in my reproducer!
Thanks!
>
>
>> Thanks,
>> Gao Xiang