Re: [RFC 00/20] ns: Introduce Time Namespace
From: Andrey Vagin
Date: Mon Sep 24 2018 - 16:52:01 EST
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> Dmitry Safonov <dima@xxxxxxxxxx> writes:
>
> > Discussions around time virtualization are there for a long time.
> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> > From that time, the topic appears on and off in various discussions.
> >
> > There are two main use cases for time namespaces:
> > 1. change date and time inside a container;
> > 2. adjust clocks for a container restored from a checkpoint.
> >
> > âIt seems like this might be one of the last major obstacles keeping
> > migration from being used in production systems, given that not all
> > containers and connections can be migrated as long as a time dependency
> > is capable of messing it up.â (by github.com/dav-ell)
> >
> > The kernel provides access to several clocks: CLOCK_REALTIME,
> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> > start points for them are not defined and are different for each running
> > system. When a container is migrated from one node to another, all
> > clocks have to be restored into consistent states; in other words, they
> > have to continue running from the same points where they have been
> > dumped.
> >
> > The main idea behind this patch set is adding per-namespace offsets for
> > system clocks. When a process in a non-root time namespace requests
> > time of a clock, a namespace offset is added to the current value of
> > this clock on a host and the sum is returned.
> >
> > All offsets are placed on a separate page, this allows up to map it as
> > part of vvar into user processes and use offsets from vdso calls.
> >
> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> > clocks.
> >
> > Questions to discuss:
> >
> > * Clone flags exhaustion. Currently there is only one unused clone flag
> > bit left, and it may be worth to use it to extend arguments of the clone
> > system call.
> >
> > * Realtime clock implementation details:
> > Is having a simple offset enough?
> > What to do when date and time is changed on the host?
> > Is there a need to adjust vfs modification and creation times?
> > Implementation for adjtime() syscall.
>
> Overall I support this effort. In my quick skim this code looked good.
Hi Eric,
Thank you for the feedback.
>
> My feeling is that we need to be able to support running ntpd and
> support one namespace doing googles smoothing of leap seconds while
> another namespace takes the leap second.
>
> What I was imagining when I was last thinking about this was one
> instance of struct timekeeper aka tk_core per time namespace. That
> structure already keeps offsets for all of the various clocks from
> the kerne internal time sources. What would be needed would be to
> pass in an appropriate time namespace pointer.
>
> I could be completely wrong as I have not take the time to completely
> trace through the code. Have you looked at pushing the time namespace
> down as far as tk_core?
>
> What I think would be the big advantage (besides ntp working) is that
> the bulk of the code could be reused. Allowing testing of the kernel's
> time code by setting up a new time namespace. So a person in production
> could setup a time namespace with the time set ahead a little bit and
> be able to verify that the kernel handles the upcoming leap second
> properly.
>
It is an interesting idea, but I have a few questions:
1. Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the
timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
overhead of this?
2. What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them
separately.
>
>
> I don't know about the vfs. I think the danger is being able to write
> dates in the future or in the past. It appears that utimes(2) and
> utimesnat(2) already allow this except for status change. So it is
> possible we simply don't care. I seem to remember that what nfs does
> is take the time stamp from the host writing to the file.
>
> I think the guide for filesystem timestamps should be to first ensure
> we don't introduce security issues, and then do what distributed
> filesystems do when dealing with hosts with different clocks.
>
> Given those those two guidlines above I don't think there is a need to
> change timestamsp the way the user namespace changes uid when displayed.
>
>
>
> As for the hardware like the real time clock we definitely should not
> let a root in a time namespace change it. We might even be able to get
> away with leaving the real time clock out of the time namespace. If not
> we need to be very careful how the real time clock is abstracted. I
> would start by leaving the real time clock hardware out of the time
> namespace and see if there is any part of userspace that cares.
>
> Eric
>
> > Cc: Dmitry Safonov <0x7f454c46@xxxxxxxxx>
> > Cc: Adrian Reber <adrian@xxxxxxxx>
> > Cc: Andrei Vagin <avagin@xxxxxxxxxx>
> > Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> > Cc: Christian Brauner <christian.brauner@xxxxxxxxxx>
> > Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
> > Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> > Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Jeff Dike <jdike@xxxxxxxxxxx>
> > Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> > Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
> > Cc: Shuah Khan <shuah@xxxxxxxxxx>
> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > Cc: containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> > Cc: criu@xxxxxxxxxx
> > Cc: linux-api@xxxxxxxxxxxxxxx
> > Cc: x86@xxxxxxxxxx
> >
> > Andrei Vagin (12):
> > ns: Introduce Time Namespace
> > timens: Add timens_offsets
> > timens: Introduce CLOCK_MONOTONIC offsets
> > timens: Introduce CLOCK_BOOTTIME offset
> > timerfd/timens: Take into account ns clock offsets
> > kernel: Take into account timens clock offsets in clock_nanosleep
> > x86/vdso/timens: Add offsets page in vvar
> > x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
> > posix-timers/timens: Take into account clock offsets
> > selftest/timens: Add test for timerfd
> > selftest/timens: Add test for clock_nanosleep
> > timens/selftest: Add timer offsets test
> >
> > Dmitry Safonov (8):
> > timens: Shift /proc/uptime
> > x86/vdso: Restrict splitting vvar vma
> > x86/vdso: Purge timens page on setns()/unshare()/clone()
> > x86/vdso: Look for vvar vma to purge timens page
> > timens: Add align for timens_offsets
> > timens: Optimize zero-offsets
> > selftest: Add Time Namespace test for supported clocks
> > timens/selftest: Add procfs selftest
> >
> > arch/Kconfig | 5 +
> > arch/x86/Kconfig | 1 +
> > arch/x86/entry/vdso/vclock_gettime.c | 52 +++++
> > arch/x86/entry/vdso/vdso-layout.lds.S | 9 +-
> > arch/x86/entry/vdso/vdso2c.c | 3 +
> > arch/x86/entry/vdso/vma.c | 67 +++++++
> > arch/x86/include/asm/vdso.h | 2 +
> > fs/proc/namespaces.c | 3 +
> > fs/proc/uptime.c | 3 +
> > fs/timerfd.c | 16 +-
> > include/linux/nsproxy.h | 1 +
> > include/linux/proc_ns.h | 1 +
> > include/linux/time_namespace.h | 72 +++++++
> > include/linux/timens_offsets.h | 25 +++
> > include/linux/user_namespace.h | 1 +
> > include/uapi/linux/sched.h | 1 +
> > init/Kconfig | 8 +
> > kernel/Makefile | 1 +
> > kernel/fork.c | 3 +-
> > kernel/nsproxy.c | 19 +-
> > kernel/time/hrtimer.c | 8 +
> > kernel/time/posix-timers.c | 89 ++++++++-
> > kernel/time/posix-timers.h | 2 +
> > kernel/time_namespace.c | 230 +++++++++++++++++++++++
> > tools/testing/selftests/timens/.gitignore | 5 +
> > tools/testing/selftests/timens/Makefile | 6 +
> > tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++
> > tools/testing/selftests/timens/config | 1 +
> > tools/testing/selftests/timens/log.h | 21 +++
> > tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++
> > tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++
> > tools/testing/selftests/timens/timer.c | 95 ++++++++++
> > tools/testing/selftests/timens/timerfd.c | 96 ++++++++++
> > 33 files changed, 1272 insertions(+), 13 deletions(-)
> > create mode 100644 include/linux/time_namespace.h
> > create mode 100644 include/linux/timens_offsets.h
> > create mode 100644 kernel/time_namespace.c
> > create mode 100644 tools/testing/selftests/timens/.gitignore
> > create mode 100644 tools/testing/selftests/timens/Makefile
> > create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
> > create mode 100644 tools/testing/selftests/timens/config
> > create mode 100644 tools/testing/selftests/timens/log.h
> > create mode 100644 tools/testing/selftests/timens/procfs.c
> > create mode 100644 tools/testing/selftests/timens/timens.c
> > create mode 100644 tools/testing/selftests/timens/timer.c
> > create mode 100644 tools/testing/selftests/timens/timerfd.c