Re: [PATCH 3/5] userns: Don't read extents twice in m_start

From: Christian Brauner
Date: Wed Nov 01 2017 - 12:31:56 EST


On Wed, Nov 01, 2017 at 02:05:39PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 01, 2017 at 06:08:35AM -0500, Eric W. Biederman wrote:
> > Nikolay Borisov <nborisov@xxxxxxxx> writes:
> >
> > > On 1.11.2017 01:48, Eric W. Biederman wrote:
> > >>
> > >> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
> > >> the map is being written does not do strange things.
> > >>
> > >> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> > >> ---
> > >> kernel/user_namespace.c | 6 ++++--
> > >> 1 file changed, 4 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> > >> index 563a2981d7c7..4f7e357ac1e2 100644
> > >> --- a/kernel/user_namespace.c
> > >> +++ b/kernel/user_namespace.c
> > >> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
> > >> struct uid_gid_map *map)
> > >> {
> > >> loff_t pos = *ppos;
> > >> + unsigned extents = map->nr_extents;
> > >> + smp_rmb();
> > >
> > > Barriers need to be paired to work correctly as well as have explicit
> > > comments describing the pairing as per kernel coding style. Checkpatch
> > > will actually produce warning for that particular memory barrier.
> >
> > So please look at the code and read the comment.
>
> What comment, there isn't any, which is what he's complaining about.
>
> > The fact the barrier was not in m_start earlier is strictly speaking a
> > bug.
>
> Sure; doesn't excuse you for not writing sensible comments to go with
> it.
>
> > In practice except for a very narrow window when this data is changing
> > the one time it can, this code does not matter at all.
> >
> > As for checkpatch I have sympathy for it, checkpatch has a hard job,
> > but I won't listen to checkpatch when it is wrong.
>
> No, undocumented barriers are a royal pain. Memory barriers should come
> with a comment that describes the desired ordering and points to the
> pairing barrier(s).
>
> Also, you probably want READ_ONCE() here and WRITE_ONCE() in
> map_write(), the compiler is free to do unordered byte loads/stores
> without it.
>
> And finally, did you want to use smp_store_release() and
> smp_load_acquire() instead?
>
> Something like so perhaps?

Given the discussion this looks fine to me unless Eric stops some issues I'm
too blind to see. So for whatever this is worth this patch would get an ack from
me.

>
> ---
> kernel/user_namespace.c | 73 ++++++++++++++++++++++++++++++-------------------
> 1 file changed, 45 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index c490f1e4313b..f758911cabd5 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -25,8 +25,47 @@
> #include <linux/fs_struct.h>
>
> static struct kmem_cache *user_ns_cachep __read_mostly;
> +
> +/*
> + * The userns_state_mutex serializes all writes to any given map.
> + *
> + * Any map is only ever written once.
> + *
> + * An id map fits within 1 cache line on most architectures.
> + */
> static DEFINE_MUTEX(userns_state_mutex);
>
> +/*
> + *
> + * There is a one time data dependency between reading the count of the extents
> + * and the values of the extents. The desired behavior is to see the values of
> + * the extents that were written before the count of the extents.
> + *
> + * To achieve this smp_store_release() is used on guarantee the write order and
> + * smp_load_acquire() is guaranteed that we don't have weakly ordered
> + * architectures returning stale data.
> + */
> +static inline void map_store_extents(struct uid_gid_map *map, unsigned int extents)
> +{
> + /*
> + * Ensure the map->extent[] stores happen-before we grow map->nr_extents
> + * to cover it. Matches the load_acquire in map_load_extents().
> + */
> + smp_store_release(&map->nr_extents, extents);
> +}
> +
> +static inline unsigned int map_load_extents(struct uid_gid_map *map)
> +{
> + /*
> + * Ensure the map->nr_extents load happens-before we try and access
> + * map->extent[], such that we guarantee the data is in fact there.
> + *
> + * Matches the store-relese in map_store_extents().
> + */
> + return smp_load_acquire(&map->nr_extents);
> +}
> +
> +
> static bool new_idmap_permitted(const struct file *file,
> struct user_namespace *ns, int cap_setid,
> struct uid_gid_map *map);
> @@ -206,8 +245,7 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
> id2 = id + count - 1;
>
> /* Find the matching extent */
> - extents = map->nr_extents;
> - smp_rmb();
> + extents = map_load_extents(map);
> for (idx = 0; idx < extents; idx++) {
> first = map->extent[idx].first;
> last = first + map->extent[idx].count - 1;
> @@ -230,8 +268,7 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
> u32 first, last;
>
> /* Find the matching extent */
> - extents = map->nr_extents;
> - smp_rmb();
> + extents = map_load_extents(map);
> for (idx = 0; idx < extents; idx++) {
> first = map->extent[idx].first;
> last = first + map->extent[idx].count - 1;
> @@ -253,8 +290,7 @@ static u32 map_id_up(struct uid_gid_map *map, u32 id)
> u32 first, last;
>
> /* Find the matching extent */
> - extents = map->nr_extents;
> - smp_rmb();
> + extents = map_load_extents(map);
> for (idx = 0; idx < extents; idx++) {
> first = map->extent[idx].lower_first;
> last = first + map->extent[idx].count - 1;
> @@ -543,7 +579,7 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
> struct uid_gid_extent *extent = NULL;
> loff_t pos = *ppos;
>
> - if (pos < map->nr_extents)
> + if (pos < map_load_extents(map))
> extent = &map->extent[pos];
>
> return extent;
> @@ -652,25 +688,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
> char *kbuf = NULL, *pos, *next_line;
> ssize_t ret = -EINVAL;
>
> - /*
> - * The userns_state_mutex serializes all writes to any given map.
> - *
> - * Any map is only ever written once.
> - *
> - * An id map fits within 1 cache line on most architectures.
> - *
> - * On read nothing needs to be done unless you are on an
> - * architecture with a crazy cache coherency model like alpha.
> - *
> - * There is a one time data dependency between reading the
> - * count of the extents and the values of the extents. The
> - * desired behavior is to see the values of the extents that
> - * were written before the count of the extents.
> - *
> - * To achieve this smp_wmb() is used on guarantee the write
> - * order and smp_rmb() is guaranteed that we don't have crazy
> - * architectures returning stale data.
> - */
> mutex_lock(&userns_state_mutex);
>
> ret = -EPERM;
> @@ -790,8 +807,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
> /* Install the map */
> memcpy(map->extent, new_map.extent,
> new_map.nr_extents*sizeof(new_map.extent[0]));
> - smp_wmb();
> - map->nr_extents = new_map.nr_extents;
> +
> + map_store_extents(map, new_map.nr_extents);
>
> *ppos = count;
> ret = count;