Re: [PATCH] x86/mm: Do not split_large_page() for set_kernel_text_rw()

From: Song Liu
Date: Mon Aug 26 2019 - 00:41:13 EST

Cc: Steven Rostedt and Suresh Siddha

Hi Peter,

> On Aug 23, 2019, at 2:36 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote:
>> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to
>> split_large_page() for all kernel text pages. This means a single kprobe
>> will put all kernel text in 4k pages:
>> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel
>> 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd
>> root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events
>> root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
>> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel
>> 0xffffffff81000000-0xffffffff82400000 20M ro x pte
>> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check
>> in static_protections().
>> Two helper functions set_text_rw() and set_text_ro() are added to flip
>> _PAGE_RW bit for kernel text.
>> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
> ARGH; so this is because ftrace flips the whole kernel range to RW and
> back for giggles? I'm thinking _that_ is a bug, it's a clear W^X
> violation.

Thanks for your comments. Yes, it is related to ftrace, as we have
CONFIG_KPROBES_ON_FTRACE. However, after digging around, I am not sure
what is the expected behavior.

Kernel text region has two mappings to it. For x86_64 and four-level
page table, there are:

1. kernel identity mapping, from 0xffff888000100000;
2. kernel text mapping, from 0xffffffff81000000,

Per comments in arch/x86/mm/init_64.c:set_kernel_text_rw():

* Make the kernel identity mapping for text RW. Kernel text
* mapping will always be RO. Refer to the comment in
* static_protections() in pageattr.c
set_memory_rw(start, (end - start) >> PAGE_SHIFT);

kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity

However, my experiment shows that kprobe actually operates on the
kernel text mapping (0xffffffff81000000-). It is the same w/ and w/o
CONFIG_KPROBES_ON_FTRACE. Therefore, I am not sure whether the comment
is out-dated (10-year old), or the kprobe is doing something wrong.

More information about the issue we are looking at.

We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/
CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in
kernel text mapping into pte-mapped pages. This increases iTLB
miss rate from about 300 per million instructions to about 700 per
million instructions (for the application I test with).

Per bisect, we found this behavior happens after commit 585948f4f695
("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I
proposed this PATCH to fix/workaround this issue. However, per
Peter's comment and my study of the code, this doesn't seem the
real problem or the only here.

I also tested that the PMD split issue doesn't happen w/o

In summary, I have the following questions:

1. Which mapping should kprobe work on? Kernel identity mapping or
kernel text mapping?
2. FTRACE causes split of PMD mapped kernel text. How should we fix