Re: [GIT PULL] ftrace: Check if pages were allocated before calling free_pages()

From: Steven Rostedt
Date: Wed Mar 31 2021 - 14:52:29 EST


On Wed, 31 Mar 2021 10:45:01 -0700
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Wed, Mar 31, 2021 at 6:27 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > order = get_count_order(pg->size / ENTRIES_PER_PAGE);
> > - free_pages((unsigned long)pg->records, order);
> > + if (order >= 0)
> > + free_pages((unsigned long)pg->records, order);
>
> Honestly, looking at that code, every single use of
> "get_count_order()" seems really really confusing.
>
> And really confused.
>
> The comments are garbage, the code is odd, it's just all very very strange.
>
> See here in ftrace_allocate_records():
>
> pages = DIV_ROUND_UP(count, ENTRIES_PER_PAGE);
> order = get_count_order(pages);
>
> /*
> * We want to fill as much as possible. No more than a page
> * may be empty.
> */
> if (!is_power_of_2(pages))
> order--;
>
> can you actually explain what the logic is here?
>
> The 'get_count_order()' function will return the smallest order that
> the value fits in. But then if the value wasn't a power of two, you
> subtract 1. If I understand the code correctly, that's just the same
> as "what is the highest bit set", isn't it?
>
> So afaik, what you *actually* want is just
>
> pages = DIV_ROUND_UP(count, ENTRIES_PER_PAGE);
> order = fls(pages)-1;
>
> isn't it?

I'll have to look at this a bit more, but this could be a new improvement
of the logic. The last one was:

b40c6eabfcd40 ("ftrace: Simplify the calculation of page number for ftrace_page->records")

This happens once at boot up (and once for each module loaded) so I never
thought about making it efficient.

>
> Did I misunderstand what the code wants to do? The "No more than a
> page may be empty" seems wrong - you really mean "no empty pages".

Yeah, that could be worded better.

>
> I dunno. Maybe I'm wrong, but that code is really confusing (which is
> why I may be wrong). It doesn't seem to make any sense at all to use
> "get_count_order()" and then modify the end result.
>

The basic idea is that we are allocating a group of pages for 10s
of thousands of records (one for every function being traced). The way the
records are stored is by "page group" (pg). It's a link list of pages
filled with records. The records are sorted by instruction pointer of the
functions they represent. There are times we need to quickly search for a
record and this is done by iterating the link list of page groups, and then
doing a binary search of the records in the pages that are there.

The more records you store in a single group, the faster this algorithm is
(more binary searches, less iterating a link list). But since these records
are permanently in the memory, we also do not want to allocate more than
needed to store the records. Now because pages are only allocated in powers
of two (1, 2, 4, 8, 16, ...) If it takes 5 pages to store all the records,
we don't want to allocate 8 pages for it. Instead we make a link list where
the first entry has 4 pages the second has 1.

In the dmesg on boot up, you'll see something like this:

ftrace: allocating 42979 entries in 168 pages
ftrace: allocated 168 pages with 3 groups


That shows that the link list is size of 3, and it used 168 pages to
store the 42,979 records.

What the logic above does is to figure out how many pages are needed to
store all the records. If the value is not a power of two (say 5 pages),
where the order would give use 8 pages. We subtract one from the order (to
give us 4 pages), and store what we can there. The next time around we only
allocate 1 page to store the rest of the records.

Your fls() trick might work too (have to gawk at it more). And I should fix
the comments. But any work on that would be for the next merge window, and
doesn't affect that this patch fixes a possible issue.

-- Steve