Re: [PATCH 01/16] rcu/tree: Add a work to allocate pages from regular context
From: Uladzislau Rezki
Date: Wed Nov 04 2020 - 13:38:55 EST
> > > > * This is a per-CPU structure. The reason that it is not included in
> > > > @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu {
> > > > bool monitor_todo;
> > > > bool initialized;
> > > > int count;
> > > > +
> > > > + struct work_struct page_cache_work;
> > > > + atomic_t work_in_progress;
> > >
> > > Does it need to be atomic? run_page_cache_work() is only called under a lock.
> > > You can use xchg() there. And when you do the atomic_set, you can use
> > > WRITE_ONCE as it is a data-race.
> > >
> > We can use xchg together with *_ONCE() macro. Could you please clarify what
> > is your concern about using atomic_t? Both xchg() and atomic_xchg() guarantee
> > atamarity. Same as WRITE_ONCE() or atomic_set().
>
> Right, whether there's lock or not does not matter as xchg() is also
> atomic-swap.
>
> atomic_t is a more complex type though, I would directly use int since
> atomic_t is not needed here and there's no lost-update issue here. It could
> be matter of style as well.
>
> BTW I did think atomic_xchg() adds additional memory barriers
> but I could not find that to be the case in the implementation. Is that not
> the case? Docs says "atomic_xchg must provide explicit memory barriers around
> the operation.".
>
In most of the systems atmoc_xchg() is same as xchg() and atomic_set()
is same as WRITE_ONCE(). But there are exceptions, for example "parisc"
*** arch/parisc/include/asm/atomic.h:
<snip>
...
#define _atomic_spin_lock_irqsave(l,f) do { \
arch_spinlock_t *s = ATOMIC_HASH(l); \
local_irq_save(f); \
arch_spin_lock(s); \
} while(0)
...
static __inline__ void atomic_set(atomic_t *v, int i)
{
unsigned long flags;
_atomic_spin_lock_irqsave(v, flags);
v->counter = i;
_atomic_spin_unlock_irqrestore(v, flags);
}
<snip>
I will switch to xchg() and WRITE_ONCE(), because of such specific ARCHs.
> > > > @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void)
> > > >
> > > > for_each_possible_cpu(cpu) {
> > > > struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> > > > - struct kvfree_rcu_bulk_data *bnode;
> > > >
> > > > for (i = 0; i < KFREE_N_BATCHES; i++) {
> > > > INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
> > > > krcp->krw_arr[i].krcp = krcp;
> > > > }
> > > >
> > > > - for (i = 0; i < rcu_min_cached_objs; i++) {
> > > > - bnode = (struct kvfree_rcu_bulk_data *)
> > > > - __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > > -
> > > > - if (bnode)
> > > > - put_cached_bnode(krcp, bnode);
> > > > - else
> > > > - pr_err("Failed to preallocate for %d CPU!\n", cpu);
> > > > - }
> > > > -
> > > > INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> > > > + INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
> > > > krcp->initialized = true;
> > >
> > > During initialization, is it not better to still pre-allocate? That way you
> > > don't have to wait to get into a situation where you need to initially
> > > allocate.
> > >
> > Since we have a worker that does it when a cache is empty there is no
> > a high need in doing it during initialization phase. If we can reduce
> > an amount of code it is always good :)
>
> I am all for not having more code than needed. But you would hit
> synchronize_rcu() slow path immediately on first headless kfree_rcu() right?
> That seems like a step back from the current code :)
>
As for slow path and hitting the synchronize_rcu() immediately. Yes, a slow
hit "counter" will be increased by 1, the difference between two variants
will be N and N + 1 times. I do not consider N + 1 as a big difference and
impact on performance.
Should we guarantee that a first user does not hit a fallback path that
invokes synchronize_rcu()? If not, i would rather remove redundant code.
Any thoughts here?
Thanks!
--
Vlad Rezki