[PATCH v2] vfs: add simple direct-mapped dcache lookup front-end

From: George Spelvin
Date: Sun Jun 12 2016 - 23:20:40 EST


This is an old patch by Linus that he asked if I could fix the race
conditions in. Posted for comment on the RCU abuse (search for "Evil
RCU Hack") and performance in general.

[Linus speaking, Thu May 31, 2012]

I've pushed __d_lookup_rcu() just about as far as I could, and it still
had some problems.

The problems were mainly due to:

- the complexity of the slow-case handling causes register spills,
- the hash chain lookup loop causes not only register pressure, but
also the extra magic "mask off lock bit from the hash chain head
pointer" etc. logic, and
- the hash list needs to be dynamically sized (we want *big* caches, but
you can't use the same size for big and small machines), which causes
the initial hash lookup itself to be more complex.

This looks like a viable solution to all three problems, and it is
actually surprisingly simple: make a trivial fixed-size direct-mapped L1
dentry cache. No chains, no locking, no nothing.

This gives measurable improvement on my microbenchmark, and gets good
hit-rates on both kernel compiles and even on something like "updatedb",
which I'd have expected to be one of the worst possible cases.
Apparently updatedb still ends up looking up the same files (/etc/fstab
etc) a lot. So those good hit-rates seem to often be due to really
stupid programming, but hey, I think we all agree that "stupid
programming" is likely the common case that we generally do need to also
optimize for ;)

For my kernel compile benchmark ("make -j" on a fully built tree), the
profile shows (this is kernel-only profile, so user space overhead
removed):

8.19% [k] link_path_walk
7.74% [k] __d_lookup_rcu
5.66% [k] selinux_inode_permission
3.73% [k] do_lookup
2.86% [k] path_lookupat
2.72% [k] avc_has_perm_noaudit
2.71% [k] inode_has_perm.isra.49.constprop.71
2.68% [k] avc_lookup
2.51% [k] generic_permission
...
0.78% [k] __d_lookup_rcu_slow
...

where "__d_lookup_rcu_slow()" is the exact same old __d_lookup_rcu(), so
it's not really "slow", but it's quite noticeably slower than the new
streamlined __d_lookup_rcu(). And as you can tell, that means that we
must have a 90%+ hitrate in the new L1 dcache lookup, since we only see
10% as much time in the slow routine as in the L1 front-end.

[George Spelvin speaking]

I fixed the race conditions in Linus's code, added Kconfig support,
and a number of comments.

I have two concerns about the performance of this code:

1) Since it was first written, RCU dcache lookup has gone completely
lockless and is even faster.
2) By adding a shared data structure that is written randomly by multiple
CPUs, this patch undoes a lot of that optimization. Even if it's a
win on a single-socket system, it may be a net loss on a larger one.

Thanks to Fengguang Wu's 0day build bot for catching a race that
triggered a WARN_ON in v1 of this patch. dentry_delay_free is now
much more careful about what it writes over.

Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: Nick Piggin <npiggin@xxxxxxxxx>
Cc: Miklos Szeredi <mszeredi@xxxxxxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: George Spelvin <linux@xxxxxxxxxxxxxxxxxxx>
---
The 0day build bot caught a race condition I added: dentry_delay_free
was too eager to set the callback pointer, tripping the
WARN_ON(!hlist_unhashed) in dentry_free(). Updated to be more
conservative, with analysis.

That actually slightly reduced the overhead. v1 had to have
dentry_free adjust d_name.name in unusual cases to enable the
delayed callback to know which final callback to invoke.

v2 doesn't need that; it inspects the function pointer instead.

fs/Kconfig | 28 ++++++++
fs/dcache.c | 226 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 250 insertions(+), 4 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index b8fcb416..f0c8ed44 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -8,6 +8,34 @@ menu "File systems"
config DCACHE_WORD_ACCESS
bool

+config L1_DCACHE_BITS
+ int "Dcache level-1 cache size (bits)"
+ range 0 20
+ default 0 if !EXPERT
+ default 0 if NUMA
+ default 10 if BASE_SMALL
+ default 13
+ help
+ The Linux kernel maintains a large cache of "dentries"
+ (directory entries) for the performance-critical task of
+ converting file names to inodes. This option enables a smaller
+ direct-mapped "level-1 cache" in front of the main dcache.
+
+ (This software "dcache" is quite different from the CPU's data
+ cache, or "D-cache". Sorry for the confusingly similar names.)
+
+ This option specifies the size of this cache, as a power of 2.
+ For example, 13 means 2^13 = 8192 entries in the L1 dcache.
+ Specify 0 to turn off the L1 dcache entirely.
+
+ The cost of enabling this is one pointer per entry, plus a
+ small amount of code.
+
+ This is an experimental feature which hopes to speed up
+ single-socket machines. On larger systems, the extra updates
+ generated by the L1 dcache probably cause too much cache-line
+ bouncing to be worth it.
+
if BLOCK

source "fs/ext2/Kconfig"
diff --git a/fs/dcache.c b/fs/dcache.c
index 35c989f2..b0451659 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -278,6 +278,64 @@ static inline int dname_external(const struct dentry *dentry)
return dentry->d_name.name != dentry->d_iname;
}

+#if CONFIG_L1_DCACHE_BITS > 0
+/*
+ * Reschedule RCU reclamation for the *next* grace period.
+ * That's a longer wait than actually necessary, but this is so rare
+ * that optimizing it is not important.
+ */
+static void __d_free_delayed(struct rcu_head *head)
+{
+ call_rcu(head, __d_free);
+}
+
+static void __d_free_external_delayed(struct rcu_head *head)
+{
+ call_rcu(head, __d_free_external);
+}
+
+/*
+ * This is an Evil RCU Hack. A rare race condition leaks an RCU-protected
+ * pointer to a freed dentry to another CPU. An even rarer race results
+ * in the first CPU exiting an RCU read locked region and triggering
+ * the end of the grace period while the second CPU is still using the
+ * pointer. So if we detect the first race happening, we call this to
+ * defer the RCU reclamation for a second grace period.
+ *
+ * The evil part: we can't remove it from the singly-linked RCU callback
+ * list, but we can safely change the callback function out from under
+ * the RCU code. It's safe because RCU provides memory barriers between
+ * this CPU going idle and the CPU calling the callbacks, which means
+ * before reading the callback function pointers.
+ *
+ * The other evil part: We have to be sure that it's being used as an
+ * RCU callback pointer. It might still be d_u.d_alias.pprev.
+ * To solve this, read the pointer and see if it points to a function.
+ *
+ * This is safe because our caller and __call_rcu both have memory
+ * barriers which pair as follows:
+ *
+ * d_add_to_l1(): __call_rcu():
+ * dcache_l1[n] = &invalid_dentry head->func = func
+ * smp_mb() smp_mb();
+ * dentry_delay_free() Add head to RCU_NEXT_TAIL
+ *
+ * The barriers pair in both directions. If the head has been queued
+ * inside RCU, then we are guaranteed to see the func set. If we don't
+ * see func set, then L1 invalidation has been seen by RCU readers and
+ * there's no need to delay the reclamation.
+ */
+static void dentry_delay_free(struct rcu_head *head)
+{
+ void (*func)(struct rcu_head *) = READ_ONCE(head->func);
+
+ if (func == __d_free)
+ WRITE_ONCE(head->func, __d_free_delayed);
+ else if (func == __d_free_external)
+ WRITE_ONCE(head->func, __d_free_external_delayed);
+}
+#endif
+
static inline void __d_set_inode_and_type(struct dentry *dentry,
struct inode *inode,
unsigned type_flags)
@@ -459,6 +517,120 @@ static void dentry_lru_add(struct dentry *dentry)
d_lru_add(dentry);
}

+#if CONFIG_L1_DCACHE_BITS > 0
+/*
+ * In front of the main dcache, we have an inclusive "L1 dcache".
+ * This is a simple direct-mapped hash table.
+ *
+ * The L1 dcache is *written* to locklessly, which makes for extremely
+ * delicate synchronization. The main point is that write/write races
+ * don't matter. As long as the reader gets *some* valid dentry pointer,
+ * it doesn't much matter which.
+ *
+ * The nasty cases occur when we add an entry to the L1 cache as it's
+ * being deallocated. There's no way to atomically check for validity
+ * and store, so we have to notice the problem after storing the entry and
+ * undo the damage. RCU helps a lot here, but doesn't solve everything.
+ */
+
+/*
+ * This has a NULL parent and zero length, and will thus
+ * never match anything. But it means that the dcache_l1
+ * array never contains NULL, so you don't need to check.
+ */
+static struct dentry invalid_dentry;
+
+#define L1_DCACHE_SIZE (1u << CONFIG_L1_DCACHE_BITS)
+
+static struct dentry *dcache_l1[L1_DCACHE_SIZE] = {
+ [0 ... L1_DCACHE_SIZE-1] = &invalid_dentry
+};
+
+static unsigned int dentry_l1_index(unsigned int hash, const struct dentry *dir)
+{
+ hash += (unsigned long) dir / L1_CACHE_BYTES;
+ hash = hash + (hash >> CONFIG_L1_DCACHE_BITS);
+ return hash & (L1_DCACHE_SIZE-1);
+}
+
+static struct dentry *d_lookup_l1(unsigned int hash, const struct dentry *dir)
+{
+ unsigned int n = dentry_l1_index(hash, dir);
+
+ return READ_ONCE(dcache_l1[n]);
+}
+
+static void d_remove_from_l1(const struct dentry *dentry)
+{
+ unsigned int n = dentry_l1_index(dentry->d_name.hash, dentry->d_parent);
+
+ WRITE_ONCE(dcache_l1[n], &invalid_dentry);
+}
+
+/*
+ * Add a dentry to L1, and take it right back out again if it was changed
+ * in the meantime. There are two races we have to worry about:
+ * 1) A race with __d_drop (but not __d_kill):
+ * An invalid entry just causes an L1 miss, but we have to be sure that
+ * *someone* removes the dentry from L1. __d_drop removes it *after*
+ * bumping the seqno, so if we don't see sequence jump, it's safe to
+ * leave it in L1.
+ * 2) A race with __dentry_kill. This is a *very* narrow race, and probably
+ * impossible to hit without preemption, but it exists.
+ *
+ * Suppose that just as we (on CPU #1) entered this function, CPU #2
+ * performed __dentry_kill. I.e. after we checked the dentry for
+ * validity but before the actual store. We noticed the problem after
+ * the store and are about to pull the entry out, but the following
+ * events happen in the following order:
+ * a) CPU #2 schedules reclamation with call_rcu(__d_free).
+ * b) CPU #3 enters an RCU read-side critical section and fetches
+ * the pointer from the L1 array.
+ * c) Then CPU #1 invalidates the L1 cache entry.
+ * d) Then CPU #1 goes RCU-idle (e.g. returns to user space).
+ * e) This ends an RCU grace period and triggers reclamation.
+ * g) Then CPU #3 tries to read from the reclaimed dentry. Oops!
+ * The problem is that RCU depends on knowing that, after call_rcu(),
+ * only CPUs already in read-side critical sections can possibly
+ * access the object, so CPU #3 can't see it. By publishing the
+ * pointer in dcache_l1[], we broke that guarantee.
+ *
+ * The solution is to delay reclamation a little bit more.
+ */
+static bool d_add_to_l1(struct dentry *dentry, unsigned int hash,
+ const struct dentry *parent, unsigned int seq)
+{
+ unsigned int n = dentry_l1_index(hash, parent);
+ bool b;
+
+ WRITE_ONCE(dcache_l1[n], dentry);
+ smp_mb(); /* Pairs with dentry_rcuwalk_invalidate in __d_drop */
+ b = __read_seqcount_retry(&dentry->d_seq, seq);
+ if (unlikely(b)) {
+ WRITE_ONCE(dcache_l1[n], &invalid_dentry);
+ smp_mb(); /* Pairs with lock/unlock(dentry->d_lock) */
+ if (dentry->d_flags & DCACHE_DENTRY_KILLED)
+ dentry_delay_free(&dentry->d_u.d_rcu);
+ }
+ return !b;
+}
+#else
+static inline void d_remove_from_l1(const struct dentry *dentry)
+{
+ (void)dentry;
+}
+
+static inline bool d_add_to_l1(struct dentry *dentry, unsigned int hash,
+ const struct dentry *parent, unsigned int seq)
+{
+ (void)dentry;
+ (void)hash;
+ (void)parent;
+ (void)seq;
+ return true;
+}
+#endif
+
/**
* d_drop - drop a dentry
* @dentry: dentry to drop
@@ -493,6 +665,7 @@ void __d_drop(struct dentry *dentry)
dentry->d_hash.pprev = NULL;
hlist_bl_unlock(b);
dentry_rcuwalk_invalidate(dentry);
+ d_remove_from_l1(dentry);
}
}
EXPORT_SYMBOL(__d_drop);
@@ -2058,7 +2231,7 @@ static noinline enum slow_d_compare slow_dentry_cmp(
}

/**
- * __d_lookup_rcu - search for a dentry (racy, store-free)
+ * __d_lookup_rcu_slow - search for a dentry (racy, store-free)
* @parent: parent dentry
* @name: qstr of name we wish to find
* @seqp: returns d_seq value at the point where the dentry was found
@@ -2086,7 +2259,7 @@ static noinline enum slow_d_compare slow_dentry_cmp(
* NOTE! The caller *has* to check the resulting dentry against the sequence
* number we've returned before using any of the resulting dentry state!
*/
-struct dentry *__d_lookup_rcu(const struct dentry *parent,
+static noinline struct dentry *__d_lookup_rcu_slow(const struct dentry *parent,
const struct qstr *name,
unsigned *seqp)
{
@@ -2157,12 +2330,57 @@ seqretry:
if (dentry->d_name.hash_len != hashlen)
continue;
*seqp = seq;
- if (!dentry_cmp(dentry, str, hashlen_len(hashlen)))
- return dentry;
+ if (dentry_cmp(dentry, str, hashlen_len(hashlen)))
+ continue;
+ if (!d_add_to_l1(dentry, hashlen_hash(hashlen), parent, seq))
+ goto seqretry;
+ return dentry;
}
return NULL;
}

+#if CONFIG_L1_DCACHE_BITS > 0
+/*
+ * Fast non-chained L1 hash lookup.
+ *
+ * This queries a fixed-size non-chained "level 1" hash table
+ * before falling back to the main dcache.
+ *
+ * The lookup is simplified by not putting anything in the hash
+ * table that would complicate the fast path. In particular:
+ * - We don't need to worry about NULL pointers because the cache
+ * is filled with dummy sentinel dentries that match nothing.
+ * - We don't need to worry about DCACHE_OP_COMPARE, because such
+ * dentries are never added to the L1 cache.
+ */
+struct dentry *__d_lookup_rcu(const struct dentry *parent,
+ const struct qstr *name,
+ unsigned int *seqp)
+{
+ u64 hashlen = name->hash_len;
+ struct dentry *dentry = d_lookup_l1(hashlen_hash(hashlen), parent);
+ unsigned int seq = raw_seqcount_begin(&dentry->d_seq);
+
+ do {
+ if (unlikely(dentry->d_parent != parent))
+ break;
+ if (unlikely(dentry->d_name.hash_len != hashlen))
+ break;
+ *seqp = seq;
+ if (unlikely(dentry_cmp(dentry, name->name, hashlen_len(hashlen))))
+ break;
+ if (unlikely(d_unhashed(dentry)))
+ break;
+ return dentry;
+ } while (0);
+ return __d_lookup_rcu_slow(parent, name, seqp);
+}
+#else
+__alias(__d_lookup_rcu_slow) struct dentry *__d_lookup_rcu(
+ const struct dentry *parent,
+ const struct qstr *name, unsigned int *seqp);
+#endif
+
/**
* d_lookup - search for a dentry
* @parent: parent dentry
--
2.8.1