RE: [PATCH] dma-buf/heaps: c9e8440eca61 staging: ion: Fix overflow and list bugs in system heap:

From: Jaewon Kim
Date: Tue Mar 28 2023 - 23:13:13 EST


>On Tue, Mar 28, 2023 at 5:58?AM Jaewon Kim <jaewon31.kim@xxxxxxxxxxx> wrote:
>>
>> Normal free:212600kB min:7664kB low:57100kB high:106536kB
>> reserved_highatomic:4096KB active_anon:276kB inactive_anon:180kB
>> active_file:1200kB inactive_file:0kB unevictable:2932kB
>> writepending:0kB present:4109312kB managed:3689488kB mlocked:2932kB
>> pagetables:13600kB bounce:0kB free_pcp:0kB local_pcp:0kB
>> free_cma:200844kB
>> Out of memory and no killable processes...
>> Kernel panic - not syncing: System is deadlocked on memory
>>
>> An OoM panic was reported, there were only native processes which are
>> non-killable as OOM_SCORE_ADJ_MIN.
>>
>> After looking into the dump, I've found the dma-buf system heap was
>> trying to allocate a huge size. It seems to be a signed negative value.
>>
>> dma_heap_ioctl_allocate(inline)
>> | heap_allocation = 0xFFFFFFC02247BD38 -> (
>> | len = 0xFFFFFFFFE7225100,
>>
>> Actually the old ion system heap had policy which does not allow that
>> huge size with commit c9e8440eca61 ("staging: ion: Fix overflow and list
>> bugs in system heap"). We need this change again. Single allocation
>> should not be bigger than half of all memory.
>>
>> Signed-off-by: Jaewon Kim <jaewon31.kim@xxxxxxxxxxx>
>> ---
>> drivers/dma-buf/heaps/system_heap.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>> index e8bd10e60998..4c1ef2ecfb0f 100644
>> --- a/drivers/dma-buf/heaps/system_heap.c
>> +++ b/drivers/dma-buf/heaps/system_heap.c
>> @@ -351,6 +351,9 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>> struct page *page, *tmp_page;
>> int i, ret = -ENOMEM;
>>
>> + if (len / PAGE_SIZE > totalram_pages() / 2)
>> + return ERR_PTR(-ENOMEM);
>> +
>
>Instead of policy like that, would __GFP_RETRY_MAYFAIL on the system
>heap's LOW_ORDER_GFP flags also avoid the panic, and eventually fail
>the allocation request?

Hello T.J.

Thank you for your opinion.
The __GFP_RETRY_MAYFAIL on LOW_ORDER_GFP seems to work.

page allocation failure: order:0, mode:0x144dc2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_COMP|__GFP_ZERO)
Node 0 active_anon:120kB inactive_anon:43012kB active_file:36kB inactive_file:788kB

I tried to test it, and the allocation stopped at very low file cache situation without OoM panic
as we expected. The phone device was freezing for few seconds though.

We can avoid OoM panic through either totalram_pages() / 2 check or __GFP_RETRY_MAYFAIL.

But I think we still need the totalram_pages() / 2 check so that we don't have to suffer
the freezing in UX perspective. We may kill some critical processes or users' recent apps.

Regarding __GFP_RETRY_MAYFAIL, I think it will help us avoid OoM panic. But I'm worried
about low memory devices which still need OoM kill to get memory like in camera scenarios.

So what do you think?

Thank you
Jaewon Kim