Re: [RFC 0/3] Put vdso in ramfs-like filesystem (vdsofs)

From: Dmitry Safonov
Date: Fri Aug 26 2016 - 10:45:09 EST


2016-08-26 17:42 GMT+03:00 Dmitry Safonov <0x7f454c46@xxxxxxxxx>:
> 2016-08-26 17:32 GMT+03:00 Andy Lutomirski <luto@xxxxxxxxxxxxxx>:
>> On Fri, Aug 26, 2016 at 4:16 AM, Dmitry Safonov <0x7f454c46@xxxxxxxxx> wrote:
>>> 2016-08-26 2:00 GMT+03:00 H. Peter Anvin <hpa@xxxxxxxxx>:
>>>> On August 25, 2016 3:53:43 PM PDT, Dmitry Safonov <0x7f454c46@xxxxxxxxx> wrote:
>>>>>2016-08-25 23:49 GMT+03:00 H. Peter Anvin <hpa@xxxxxxxxx>:
>>>>>> On August 25, 2016 8:21:07 AM PDT, Dmitry Safonov
>>>>><dsafonov@xxxxxxxxxxxxx> wrote:
>>>>>>>This patches set is cleanly RFC and is not supposed to be applied.
>>>>>>>Also for RFC time it builds only on x86_64.
>>>>>>>
>>>>>>>So, in a mail thread Oleg told that it would be worth to introduce
>>>>>>>vm_file
>>>>>>>for vdso mappings as currently uprobes can not be placed on vDSO VMAs
>>>>>>>[1].
>>>>>>>In this patches set I introduce in-kernel filesystem for vdso files.
>>>>>>>After patches vDSO VMA now has inode and is just a private file
>>>>>>>mapping:
>>>>>>>7ffcc4b2b000-7ffcc4b2d000 r--p 00000000 00:00 0
>>>>>>> [vvar]
>>>>>>>7ffcc4b2d000-7ffcc4b2f000 r-xp 00000000 00:09 18
>>>>>>> [vdso]
>>>>>>>
>>>>>>>Then I introduce interface in uprobe_events to insert uprobes in
>>>>>vdso.
>>>>>>>FWIW:
>>>>>>> [~]# cd kernel/linux
>>>>>>> [linux]# readelf --syms arch/x86/entry/vdso/vdso64.so
>>>>>>>Symbol table '.dynsym' contains 11 entries:
>>>>>>> Num: Value Size Type Bind Vis Ndx Name
>>>>>>> 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
>>>>>>> 1: 0000000000000470 0 SECTION LOCAL DEFAULT 8
>>>>>>>2: 00000000000008d0 885 FUNC WEAK DEFAULT 12
>>>>>>>clock_gettime@@LINUX_2.6
>>>>>>>3: 0000000000000c50 472 FUNC GLOBAL DEFAULT 12
>>>>>>>__vdso_gettimeofday@@LINUX_2.6
>>>>>>>4: 0000000000000c50 472 FUNC WEAK DEFAULT 12
>>>>>>>gettimeofday@@LINUX_2.6
>>>>>>>5: 0000000000000e30 21 FUNC GLOBAL DEFAULT 12
>>>>>>>__vdso_time@@LINUX_2.6
>>>>>>> 6: 0000000000000e30 21 FUNC WEAK DEFAULT 12
>>>>>time@@LINUX_2.6
>>>>>>>7: 00000000000008d0 885 FUNC GLOBAL DEFAULT 12
>>>>>>>__vdso_clock_gettime@@LINUX_2.6
>>>>>>> 8: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6
>>>>>>>9: 0000000000000e50 41 FUNC GLOBAL DEFAULT 12
>>>>>>>__vdso_getcpu@@LINUX_2.6
>>>>>>>10: 0000000000000e50 41 FUNC WEAK DEFAULT 12
>>>>>>>getcpu@@LINUX_2.6
>>>>>>> [~]# cd /sys/kernel/debug/tracing/
>>>>>>> [tracing]# echo 'p:clock_gettime :vdso:/64:0x8d0' > uprobe_events
>>>>>>> [tracing]# echo 'p:gettimeofday :vdso:/64:0xc50' >> uprobe_events
>>>>>>> [tracing]# echo 'p:time :vdso:/64:0xe30' >> uprobe_events
>>>>>>> [tracing]# echo 1 > events/uprobes/enable
>>>>>>> [tracing]# su test # it has UID=1001
>>>>>>> [tracing]$ date
>>>>>>> Thu Aug 25 17:19:29 MSK 2016
>>>>>>> [tracing]$ exit
>>>>>>> [tracing]# cat trace
>>>>>>> # tracer: nop
>>>>>>> #
>>>>>>> # entries-in-buffer/entries-written: 175/175 #P:4
>>>>>>> #
>>>>>>> # _-----=> irqs-off
>>>>>>> # / _----=> need-resched
>>>>>>> # | / _---=> hardirq/softirq
>>>>>>> # || / _--=> preempt-depth
>>>>>>> # ||| / delay
>>>>>>> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
>>>>>>> # | | | |||| | |
>>>>>>> bash-11560 [001] d... 316.470236: time:
>>>>>(0x7ffcacebae30)
>>>>>>> bash-11560 [001] d... 316.471436: gettimeofday:
>>>>>(0x7ffcacebac50)
>>>>>>> bash-11560 [001] d... 316.477550: time:
>>>>>(0x7ffcacebae30)
>>>>>>> bash-11560 [001] d... 316.477655: time:
>>>>>(0x7ffcacebae30)
>>>>>>> mktemp-11568 [001] d... 316.479589: gettimeofday:
>>>>>(0x7ffc603f0c50)
>>>>>>> date-11571 [001] d... 316.481890: clock_gettime:
>>>>>(0x7ffec9db58d0)
>>>>>>>[...]
>>>>>>>
>>>>>>>If this approach will be decided as fine, I will prepare a better
>>>>>>>version,
>>>>>>>fixing the following things:
>>>>>>>o put vdsofs in generic fs/* dir
>>>>>>>o support other archs and vdso blobs
>>>>>>>o remove BUG_ON()'s and UID==1001 check
>>>>>>>o remove extern's and use headers only
>>>>>>>o refactor code in create_trace_uprobe()
>>>>>>>o add some state to (struct trace_uprobe), so i.e., `cat
>>>>>uprobe_events`
>>>>>>>will
>>>>>>> print those uprobes as vdso-based
>>>>>>>o document this interface in Documentation/trace/uprobetracer.txt
>>>>>>>o prepare nice patches set?
>>>>>>>
>>>>>>>So, opinions? Is it worth to add something like this?
>>>>>>>
>>>>>>>[1]: https://lkml.org/lkml/2016/7/12/346
>>>>>>>
>>>>>>>Dmitry Safonov (3):
>>>>>>> x86/vdso: create vdso file, use it for mapping
>>>>>>> uprobe: drop isdigit() check in create_trace_uprobe
>>>>>>> uprobe: add vdso support
>>>>>>>
>>>>>>>Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
>>>>>>>Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
>>>>>>>Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
>>>>>>>Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>>>>>>>Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>>>>>>Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>>>>>>>Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
>>>>>>>Cc: x86@xxxxxxxxxx
>>>>>>>Cc: Dmitry Safonov <0x7f454c46@xxxxxxxxx>
>>>>>>>
>>>>>>>arch/x86/entry/vdso/vma.c | 148
>>>>>>>++++++++++++++++++++++++++++++++++++++++++--
>>>>>>> kernel/trace/trace_uprobe.c | 50 +++++++++++----
>>>>>>> 2 files changed, 180 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> I think there is a lot to be said for this idea. However, a private
>>>>>mapping is definitely wrong for the vvar data; for the vdso code it
>>>>>could be considered either way I suppose.
>>>>>
>>>>>Thanks on your reply.
>>>>>As you could see, I preserved pure mapping of pfn for vvar:
>>>>>7ffcc4b2b000-7ffcc4b2d000 r--p 00000000 00:00 0
>>>>> [vvar]
>>>>>7ffcc4b2d000-7ffcc4b2f000 r-xp 00000000 00:09 18
>>>>> [vdso]
>>>>>(no inode number).
>>>>>I also think it would be useless to do the same to vvar as it
>>>>>has just data and there is no point in probing it.
>>>>
>>>> Well, it would things like mremap() just work and so on. Let's get rid of special cases if we are.
>>>
>>> Well, for RFC it wouldn't move context.vdso pointer on mremap(),
>>> but as RFC is for x86_64 only, it will work on it.
>>> Anyway, I don't think it would be hard to fix and make mremap() work on
>>> other archs on post-RFC.
>>>
>>> The only corner-case I see for now is that /proc/self/map_files/<vdso_range>
>>> will point to [vdso] which is broken link. But one could read this file
>>> and dump/read vdso blob.
>>> So, in the other words: if some program assumes that /proc/self/map_files/*
>>> should always point to correct file, it may be confused. Not sure, maybe
>>> it would be confused by orphane-file mappings, so having dangling link
>>> there is just fine.
>>
>> I don't see anything a priori wrong with having map_files point
>> somewhere, but it could be worth special casing it for special
>> mappings to preserve existing behavior (no file at all).
>
> Yep, that could be easily done, will do.
> Anyway, just curious - what may it break?
>
> Thanks on the reply, Andy. Does the patches set look sane for you?

I mean, the idea, not current big TODO list and nitpicks :)

--
Dmitry