Re: [PATCH v8 6/7] apei/ghes: Use unrcu_pointer for cmpxchg

From: Ard Biesheuvel
Date: Fri Oct 14 2022 - 10:32:00 EST


On Fri, 14 Oct 2022 at 14:00, Justin He <Justin.He@xxxxxxx> wrote:
>
> Hi Ard
>
> > -----Original Message-----
> > From: Ard Biesheuvel <ardb@xxxxxxxxxx>
> > Sent: Thursday, October 13, 2022 11:41 PM
> > To: Borislav Petkov <bp@xxxxxxxxx>
> > Cc: Justin He <Justin.He@xxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; James
> > Morse <James.Morse@xxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>; Mauro
> > Carvalho Chehab <mchehab@xxxxxxxxxx>; Robert Richter <rric@xxxxxxxxxx>;
> > Robert Moore <robert.moore@xxxxxxxxx>; Qiuxu Zhuo
> > <qiuxu.zhuo@xxxxxxxxx>; Yazen Ghannam <yazen.ghannam@xxxxxxx>; Jan
> > Luebbe <jlu@xxxxxxxxxxxxxx>; Khuong Dinh
> > <khuong@xxxxxxxxxxxxxxxxxxxxxx>; Kani Toshi <toshi.kani@xxxxxxx>;
> > linux-acpi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > linux-edac@xxxxxxxxxxxxxxx; devel@xxxxxxxxxx; Rafael J . Wysocki
> > <rafael@xxxxxxxxxx>; Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>; Jarkko
> > Sakkinen <jarkko@xxxxxxxxxx>; linux-efi@xxxxxxxxxxxxxxx; nd <nd@xxxxxxx>;
> > kernel test robot <lkp@xxxxxxxxx>
> > Subject: Re: [PATCH v8 6/7] apei/ghes: Use unrcu_pointer for cmpxchg
> >
> > On Thu, 13 Oct 2022 at 15:37, Borislav Petkov <bp@xxxxxxxxx> wrote:
> > >
> > > On Wed, Oct 12, 2022 at 12:04:57PM +0000, Justin He wrote:
> > > > I have a concern about what if cmpxchg failed? Do we have to still
> > > > guarantee the ordering since cmpxchg will not imply a smp_mb if it
> > > > failed.
> > >
> > > Of course it will imply that. At least on x86 it does. smp_wmb() is a
> > > compiler barrier there and cmpxchg() already has that barrier
> > > semantics by clobbering "memory". I'm pretty sure you should have the
> > > same thing on ARM.
> > >
> >
> > No it definitely does not imply that. A memory clobber is a codegen construct,
> > and the hardware could still complete the writes in a way that could result in
> > another observer seeing a mix of old and new values that is inconsistent with
> > the ordering of the stores as issued by the compiler.
> >
> > > says, "new_cache must be put into array after its contents are written".
> > >
> > > Are we writing anything into the cache if cmpxchg fails?
> > >
> >
> > The cache fields get updated but the pointer to the struct is never shared
> > globally if the cmpxchg() fails so not having the barrier on failure should not be
> > an issue here.
> >
> > > > Besides, I didn't find the paired smp_mb or smp_rmb for this smp_wmb.
> > >
> > > Why would there be pairs? I don't understand that statement here.
> > >
> >
> > Typically, the other observer pairs the write barrier with a read barrier.
> >
> > In this case, the other observer appears to be ghes_estatus_cached(), and the
> > reads of the cache struct fields must be ordered after the read of the cache
> > struct's address. However, there is an implicit ordering there through an address
> > dependency (you cannot dereference a struct without knowing its address) so
> > the ordering is implied (and
> > rcu_dereference() has a READ_ONCE() inside so we are guaranteed to always
> > dereference the same struct, even if the array slot gets updated concurrently.)
> >
> > If you want to get rid of the barrier, you could drop it and change the cmpxchg()
> > to cmpxchg_release().
> >
> > Justin: so why are the RCU_INITIALIZER()s needed here?
>
> In my this patch, I add the "__rcu" to the definition of ghes_estatus_caches. Hence
> Sparse will still have the warning on X86 with this RCU_INITIALIZER cast.
> drivers/acpi/apei/ghes.c:843:27: sparse: warning: incorrect type in initializer (different address spaces)
> drivers/acpi/apei/ghes.c:843:27: sparse: expected struct ghes_estatus_cache [noderef] __rcu *__old
> drivers/acpi/apei/ghes.c:843:27: sparse: got struct ghes_estatus_cache *[assigned] slot_cache
> drivers/acpi/apei/ghes.c:843:27: sparse: warning: incorrect type in initializer (different address spaces)
> drivers/acpi/apei/ghes.c:843:27: sparse: expected struct ghes_estatus_cache [noderef] __rcu *__new
> drivers/acpi/apei/ghes.c:843:27: sparse: got struct ghes_estatus_cache *[assigned] new_cache
>
> On Arm, IMO the macro cmpxchg doesn't care about it, that is, sparse will not report warnings with or
> without RCU_INITIALIZER cast.
>
> I tend to remain this cast, what do you think of it.
>

OK, fair enough, I had only tested the arm64 build myself.

But just putting unrcu_pointer() and RCU_INITIALIZER() arbitrarily to
shut up sparse is a bit sloppy, imho. Passing around pointers like
this code does makes that necessary, unfortunately, so it would be
nice if we could clean that up, by getting rid of the slot_cache
variable.

And now that I have spent some time looking at this code, I wonder
what the point of the cmpxchg() is in the first place.
ghes_estatus_cache_add() selects a slot, and either succeeds in
replacing its contents with a pointer to a new cached item, or it just
gives up and frees the new item again, without attempting to select
another slot even if one might be available.

Since we only insert new items, the race can only cause a failure if
the selected slot was updated with another new item concurrently,
which means that it is arbitrary which of those two items gets
dropped. This means we don't need the cmpxchg() and the special case,
and we can just drop the existing item unconditionally. Note that this
does not result in loss of error events, it simply means we might
cause a false cache miss, and report the same event one additional
time in quick succession even if the cache should have prevented that.


------------------------8<------------------------------
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 80ad530583c9..03acdfa35dab 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -138,7 +138,7 @@ struct ghes_vendor_record_entry {
static struct gen_pool *ghes_estatus_pool;
static unsigned long ghes_estatus_pool_size_request;

-static struct ghes_estatus_cache
*ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
+static struct ghes_estatus_cache __rcu
*ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;

static int ghes_panic_timeout __read_mostly = 30;
@@ -773,31 +773,26 @@ static struct ghes_estatus_cache
*ghes_estatus_cache_alloc(
return cache;
}

-static void ghes_estatus_cache_free(struct ghes_estatus_cache *cache)
+static void ghes_estatus_cache_rcu_free(struct rcu_head *head)
{
+ struct ghes_estatus_cache *cache;
u32 len;

+ cache = container_of(head, struct ghes_estatus_cache, rcu);
len = cper_estatus_len(GHES_ESTATUS_FROM_CACHE(cache));
len = GHES_ESTATUS_CACHE_LEN(len);
gen_pool_free(ghes_estatus_pool, (unsigned long)cache, len);
atomic_dec(&ghes_estatus_cache_alloced);
}

-static void ghes_estatus_cache_rcu_free(struct rcu_head *head)
-{
- struct ghes_estatus_cache *cache;
-
- cache = container_of(head, struct ghes_estatus_cache, rcu);
- ghes_estatus_cache_free(cache);
-}
-
static void ghes_estatus_cache_add(
struct acpi_hest_generic *generic,
struct acpi_hest_generic_status *estatus)
{
int i, slot = -1, count;
unsigned long long now, duration, period, max_period = 0;
- struct ghes_estatus_cache *cache, *slot_cache = NULL, *new_cache;
+ struct ghes_estatus_cache *cache, *new_cache;
+ struct ghes_estatus_cache __rcu *victim;

new_cache = ghes_estatus_cache_alloc(generic, estatus);
if (new_cache == NULL)
@@ -808,13 +803,11 @@ static void ghes_estatus_cache_add(
cache = rcu_dereference(ghes_estatus_caches[i]);
if (cache == NULL) {
slot = i;
- slot_cache = NULL;
break;
}
duration = now - cache->time_in;
if (duration >= GHES_ESTATUS_IN_CACHE_MAX_NSEC) {
slot = i;
- slot_cache = cache;
break;
}
count = atomic_read(&cache->count);
@@ -823,17 +816,28 @@ static void ghes_estatus_cache_add(
if (period > max_period) {
max_period = period;
slot = i;
- slot_cache = cache;
}
}
- /* new_cache must be put into array after its contents are written */
- smp_wmb();
- if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
- slot_cache, new_cache) == slot_cache) {
- if (slot_cache)
- call_rcu(&slot_cache->rcu, ghes_estatus_cache_rcu_free);
- } else
- ghes_estatus_cache_free(new_cache);
+ if (slot != -1) {
+ /*
+ * Use release semantics to ensure that ghes_estatus_cached()
+ * running on another CPU will see the updated cache fields if
+ * it can see the new value of the pointer.
+ */
+ victim = xchg_release(ghes_estatus_caches + slot,
+ RCU_INITIALIZER(new_cache));
+
+ /*
+ * At this point, victim may point to a cached item different
+ * from the one based on which we selected the slot. Instead of
+ * going to the loop again to pick another slot, let's just
+ * drop the other item anyway: this may cause a false cache
+ * miss later on, but that won't cause any problems.
+ */
+ if (victim)
+ call_rcu(&rcu_dereference(victim)->rcu,
+ ghes_estatus_cache_rcu_free);
+ }
rcu_read_unlock();
}