Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

From: Waiman Long
Date: Tue Oct 22 2019 - 14:00:49 EST


On 10/22/19 12:57 PM, Michal Hocko wrote:
> [Cc Mel]
>
> On Tue 22-10-19 12:21:56, Waiman Long wrote:
>> The pagetypeinfo_showfree_print() function prints out the number of
>> free blocks for each of the page orders and migrate types. The current
>> code just iterates the each of the free lists to get counts. There are
>> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
>> file just because it look too long to iterate all the free lists within
>> a zone while holing the zone lock with irq disabled.
>>
>> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
>> of crashing a system by the simple act of reading /proc/pagetypeinfo
>> by any user is a security problem that needs to be addressed.
> Should we make the file 0400? It is a useful thing when debugging but
> not something regular users would really need for life.
>
I am not against doing that, but it may break existing applications that
somehow need to read pagetypeinfo. That is why I didn't try to advocate
about changing protection.


>> There is a free_area structure associated with each page order. There
>> is also a nr_free count within the free_area for all the different
>> migration types combined. Tracking the number of free list entries
>> for each migration type will probably add some overhead to the fast
>> paths like moving pages from one migration type to another which may
>> not be desirable.
> Have you tried to measure that overhead?

I haven't tried to measure the performance impact yet. I did thought
about tracking nr_free for each of the migration types within a
free_area. That will require auditing the code to make sure that all the
intra-free_area migrations are properly accounted for. I can work on it
if people prefer this alternative.


>
>> we can actually skip iterating the list of one of the migration types
>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>> is usually the largest one on large memory systems, this is the one
>> to be skipped. Since the printing order is migration-type => order, we
>> will have to store the counts in an internal 2D array before printing
>> them out.
>>
>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>> zone lock for too long blocking out other zone lock waiters from being
>> run. This can be problematic for systems with large amount of memory.
>> So a check is added to temporarily release the lock and reschedule if
>> more than 64k of list entries have been iterated for each order. With
>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>> entries before releasing the lock.
> But you are still iterating through the whole free_list at once so if it
> gets really large then this is still possible. I think it would be
> preferable to use per migratetype nr_free if it doesn't cause any
> regressions.
>
Yes, it is still theoretically possible. I will take a further look at
having per-migrate type nr_free. BTW, there is one more place where the
free lists are being iterated with zone lock held - mark_free_pages().

Cheers,
Longman