Re: [PATCH v1] kernel/trace:check the val against the available mem

From: Michal Hocko
Date: Tue Apr 03 2018 - 12:11:28 EST


On Tue 03-04-18 10:17:53, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 15:56:07 +0200
> Michal Hocko <mhocko@xxxxxxxxxx> wrote:
[...]
> > I simply do not see the difference between the two. Both have the same
> > deadly effect in the end. The direct OOM has an arguable advantage that
> > the effect is immediate rather than subtle with potential performance
> > side effects until the machine OOMs after crawling for quite some time.
>
> The difference is if the allocation succeeds or not. If it doesn't
> succeed, we free all memory that we tried to allocate. If it succeeds
> and causes issues, then yes, that's the admins fault.

What am I trying to say is that this is so extremely time and workload
sensitive that you can hardly have a stable behavior. It will become a
pure luck whether the failure happens.

> I'm worried about
> the accidental putting in too big of a number, either by an admin by
> mistake, or some stupid script that just thinks the current machines
> has terabytes of memory.

I would argue that stupid scripts should have no business calling root
only interfaces which can allocate a lot of memory and cause OOMs.

> I'm under the assumption that if I allocate an allocation of 32 pages
> with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and
> while my allocation is reclaiming memory, and another task comes in and
> asks for a single page, it can still succeed. This would be why I would
> be using RETRY_MAYFAIL with higher orders of pages, that it doesn't
> take all memory in the system if it fails. Is this assumption incorrect?

Yes. There is no guarantee that the allocation will get the memory it
reclaimed in the direct reclaim. Pages are simply freed back into the
pool and it is a matter of timing who gets them.

> The current approach of allocating 1 page at a time with RETRY_MAYFAIL
> is that it will succeed to get any pages that are available, until
> there are none, and if some unlucky task asks for memory during that
> time, it is guaranteed to fail its allocation triggering an OOM.
>
> I was thinking of doing something like:
>
> large_pages = nr_pages / 32;
> if (large_pages) {
> pages = alloc_pages_node(cpu_to_node(cpu),
> GFP_KERNEL | __GFP_RETRY_MAYFAIL, 5);
> if (pages)
> /* break up pages */
> else
> /* try to allocate with NORETRY */
> }

You can do so, of course. In fact it would have some advantages over
single pages because you would fragment the memory less but this is not
a reliable prevention from OOM killing and the complete memory
depletion if you allow arbitrary trace buffer sizes.

> Now it will allocate memory in 32 page chunks using reclaim. If it
> fails to allocate them, it would not have taken up any smaller chunks
> that were available, leaving them for other users. It would then go
> back to singe pages, allocating with RETRY. Or I could just say screw
> it, and make the allocation of the ring buffer always be 32 page chunks
> (or at least make it user defined).

yes a fallback is questionable. Whether to make the batch size
configuration is a matter of how much internal details you want to
expose to userspace.
--
Michal Hocko
SUSE Labs