Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault

From: Nadav Amit
Date: Fri Feb 22 2019 - 18:22:19 EST


> On Feb 22, 2019, at 3:02 PM, Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>
>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@xxxxxxxxxx> wrote:
>>>>
>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>>>>
>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>>>>
>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>>>> on a kernel access to user space.
>>>>>>>>>>
>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>>>>
>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>>>> it will break the expectations.
>>>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>>>> in the loop?
>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>>>> and shows up in their perf report.
>>>>>>>>>
>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>>>>
>>>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>>>> I don't think it's a good idea.
>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>>>>
>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you donât separately know whether itâs a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd. I donât know whether Iâll win this fight, but the uapi will probably have to change for at least s390x.
>>>>>>>
>>>>>>> What to do about existing scripts is a different question.
>>>>>>
>>>>>> This lack of logical separation between user and kernel addresses
>>>>>> might interact interestingly with the text_poke series, specifically
>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd03df2db76624da8eb2008d69919e41a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864733821233906&amp;sdata=ky5iTrsCceoPwVW5N9FB4sDspwGEQ8MTlRE4b1Bqn54%3D&amp;reserved=0)
>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd03df2db76624da8eb2008d69919e41a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864733821233906&amp;sdata=EJs8doLrfFfMTKiVHmWjmpnpWolmuuZ5pxOmEMcI0ew%3D&amp;reserved=0),
>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>>>> task that has switched to the patching mm, could they use
>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>>>> overwrite kernel text that is mapped writable in the patching mm?
>>>>>
>>>>> Yes, this is a good point. I guess text_poke() should be defined with
>>>>> â__kprobesâ and open-code memcpy.
>>>>>
>>>>> Does it sound reasonable?
>>>>
>>>> Doesn't __text_poke() as implemented in the proposed patch use a
>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>>>> pte_clear() (which can be a call into a separate function on paravirt
>>>> kernels)?
>>>
>>> I will move the pte_clear() to be done after the poking mm was unloaded.
>>> Give me a few minutes to send a sketch of what I think should be done.
>>
>> Err.. You are right, I donât see an easy way of preventing a kprobe from
>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>>
>> The reasonable solution seems to me as taking all the relevant pieces of
>> code (and data) that might be used during text-poking and encapsulating them, so they
>> will be set in a memory area which cannot be kprobe'd. This can also be
>> useful to write-protect data structures of code that calls text_poke(),
>> e.g., static-keys. It can also protect data on that stack that is used
>> during text_poke() from being overwritten from another core.
>>
>> This solution is somewhat similar to Igor Stoppaâs idea of using âenclavesâ
>> when doing write-rarely operations.
>>
>> Right now, I think that text_poke() will keep being susceptible to such
>> an attack, unless you have a better suggestion.
>
> A relatively simple approach might be to teach BPF not to run kprobe
> programs and such in contexts where current->mm isn't the active mm?
> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> perf_callchain_user() also already uses that. Except that a lot of
> this code is x86-specificâ

I keep having in mind how to reduce the TCB that is used while text_poke()
is running, but for the specific issue here, I think your approach would
be fine, and trace_call_bpf() can be modified to do what you ask for.

But, I am not sure that relying on current->mm gets us any more security,
relatively to having another unrelated explicit kprobe-disable indication,
which is cleaner from design point-of-view.

I can see how we get âsome more securityâ if our decision whether kprobes
should be enabled was purely based on some hardware register (e.g., CR3) and
we could unequivocally realize whether kprobes eBPF should be on/off without
memory accesses (e.g., PCID bit set). Yet, I am not sure it worth it.

What do you say?