Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command
From: Joel Fernandes
Date: Tue Jul 31 2018 - 17:56:56 EST
On Mon, Jul 30, 2018 at 09:03:18PM -0700, Y Song wrote:
> On Mon, Jul 30, 2018 at 7:06 PM, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Jul 30, 2018 at 07:01:22PM -0700, Joel Fernandes wrote:
> >> On Sun, Jul 29, 2018 at 06:51:18PM +0300, Alexei Starovoitov wrote:
> >> > On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <dancol@xxxxxxxxxx> wrote:
> >> > > BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
> >> > > map made by a BPF program that is running at the time the
> >> > > BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
> >> > > to provide a means for userspace to replace a BPF map with another,
> >> > > newer version, then ensure that no component is still using the "old"
> >> > > map before manipulating the "old" map in some way.
> >> > >
> >> > > Signed-off-by: Daniel Colascione <dancol@xxxxxxxxxx>
> >> > > ---
> >> > > include/uapi/linux/bpf.h | 9 +++++++++
> >> > > kernel/bpf/syscall.c | 13 +++++++++++++
> >> > > 2 files changed, 22 insertions(+)
> >> > >
> >> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > > index b7db3261c62d..5b27e9117d3e 100644
> >> > > --- a/include/uapi/linux/bpf.h
> >> > > +++ b/include/uapi/linux/bpf.h
> >> > > @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
> >> > > __u8 data[0]; /* Arbitrary size */
> >> > > };
> >> > >
> >> > > +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
> >> > > + * BPF map made by a BPF program that is running at the time the
> >> > > + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
> >> >
> >> > that doesn't sound right to me.
> >> > such command won't wait for the release of the references.
> >> > in case of map-in-map the program does not hold
> >> > the references to inner map (only to outer map).
> >>
> >> I didn't follow this completely.
> >>
> >> The userspace program is using the inner map per your description of the
> >> algorithm for using map-in-map to solve the race conditions that this patch
> >> is trying to address:
> >>
> >> If you don't mind, I copy-pasted it below from your netdev post:
> >>
> >> if you use map-in-map you don't need extra boolean map.
> >> 0. bpf prog can do
> >> inner_map = lookup(map_in_map, key=0);
> >> lookup(inner_map, your_real_key);
> >> 1. user space writes into map_in_map[0] <- FD of new map
> >> 2. some cpus are using old inner map and some a new
> >> 3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
> >> which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
> >> which will guarantee that progs finished.
> >> 4. scan old inner map
> >>
> >> In step 2, as you mentioned there are CPUs using different inner maps. So
> >> could you clarify how the synchronize_rcu mechanism will even work if you're
> >> now saying "program does not hold references to the inner maps"?
>
> The program only held references to the outer maps, and the outer map
> held references to the inner maps. The user space program can add/remove
> the inner map for a particular outer map while the prog <-> outer-map
> relationship is not changed.
My definition of "reference" in this context is protection by rcu_read_lock.
So I was concerned the above map-in-map access isn't protected as such when
Alexei said "program doesn't have reference on inner map" in the above steps.
Maybe I misunderstood what is the meaning of reference here.
To make the map-in-map thing to work for Chenbo/Lorenzo's usecase, both the
access of outer map at key=0 and the inner map have to protect by
rcu_read_lock so that the membarrier call will work.
So basically step 0 in the steps above should be rcu_read_lock protected to
satisfy Chenbo/Lorenzo's usecase.
I know today the entire program is run as preempt disabled (unless something
changed) so this shouldn't be a problem, but in the future if the verifier is
doing similar things at a finer grainer level, then the above has to be
taken into consideration.
Does that make sense or am I missing something?
thanks,
- Joel