Re: [PATCH] x86/mm: Do not split_large_page() for set_kernel_text_rw()
From: Song Liu
Date: Mon Aug 26 2019 - 00:41:13 EST
Cc: Steven Rostedt and Suresh Siddha
Hi Peter,
> On Aug 23, 2019, at 2:36 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote:
>> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to
>> split_large_page() for all kernel text pages. This means a single kprobe
>> will put all kernel text in 4k pages:
>>
>> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel
>> 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd
>>
>> root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events
>> root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
>>
>> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel
>> 0xffffffff81000000-0xffffffff82400000 20M ro x pte
>>
>> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check
>> in static_protections().
>>
>> Two helper functions set_text_rw() and set_text_ro() are added to flip
>> _PAGE_RW bit for kernel text.
>>
>> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
>
> ARGH; so this is because ftrace flips the whole kernel range to RW and
> back for giggles? I'm thinking _that_ is a bug, it's a clear W^X
> violation.
Thanks for your comments. Yes, it is related to ftrace, as we have
CONFIG_KPROBES_ON_FTRACE. However, after digging around, I am not sure
what is the expected behavior.
Kernel text region has two mappings to it. For x86_64 and four-level
page table, there are:
1. kernel identity mapping, from 0xffff888000100000;
2. kernel text mapping, from 0xffffffff81000000,
Per comments in arch/x86/mm/init_64.c:set_kernel_text_rw():
/*
* Make the kernel identity mapping for text RW. Kernel text
* mapping will always be RO. Refer to the comment in
* static_protections() in pageattr.c
*/
set_memory_rw(start, (end - start) >> PAGE_SHIFT);
kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity
mapping.
However, my experiment shows that kprobe actually operates on the
kernel text mapping (0xffffffff81000000-). It is the same w/ and w/o
CONFIG_KPROBES_ON_FTRACE. Therefore, I am not sure whether the comment
is out-dated (10-year old), or the kprobe is doing something wrong.
More information about the issue we are looking at.
We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/
CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in
kernel text mapping into pte-mapped pages. This increases iTLB
miss rate from about 300 per million instructions to about 700 per
million instructions (for the application I test with).
Per bisect, we found this behavior happens after commit 585948f4f695
("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I
proposed this PATCH to fix/workaround this issue. However, per
Peter's comment and my study of the code, this doesn't seem the
real problem or the only here.
I also tested that the PMD split issue doesn't happen w/o
CONFIG_KPROBES_ON_FTRACE.
In summary, I have the following questions:
1. Which mapping should kprobe work on? Kernel identity mapping or
kernel text mapping?
2. FTRACE causes split of PMD mapped kernel text. How should we fix
this?
Thanks,
Song