On Mon, 11 Dec 2023 22:51:04 -0500[...]
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
For this first issue, here is the race:
rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
<interrupted before updating bottom>
__rb_time_read()
[...]
do {
c = local_read(&t->cnt);
top = local_read(&t->top);
bottom = local_read(&t->bottom);
msb = local_read(&t->msb);
} while (c != local_read(&t->cnt));
*cnt = rb_time_cnt(top);
/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
^ this check fails to catch that "bottom" is still not updated.
So the old "bottom" value is returned, which is wrong.
Ah, OK that makes more sense. Yeah, if I had the three words from the
beginning, I would have tested to make sure they all match an not just the
two :-p
As this would fix a commit that tried to fix this before!
f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()")
FYI, that would be the "Fixes" for this patch.
- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
and produces corrupted time stamps. This is _not_ fixed by this patch.
Except that it's not 4 bits that is compared, but 32 bits.
struct rb_time_struct {
local_t cnt;
local_t top;
local_t bottom;
local_t msb;
};
The full local_t (32 bits) is used for synchronization. But the other
elements do get extra bits and there still might be some issues, but not as
severe as you stated here.
Let's bring up the race scenario I spotted:
rb_time_cmpxchg()
[...]
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
return false;
if (val != expect)
return false;
<interrupted by 4x rb_time_set() or rb_time_cmpxchg()>
<iret>
cnt = local_read(&t->cnt);
if ((cnt & 3) != cnt2)
return false;
^ here (cnt & 3) == cnt2, but @val contains outdated data. This
means the piecewise rb_time_read_cmpxchg() that follow will
derive expected values from the outdated @val.
Ah. Of course this would be fixed if we did the local_read(&t->cnt)
*before* everything else.
cnt2 = cnt + 1;
rb_time_split(val, &top, &bottom, &msb);
top = rb_time_val_cnt(top, cnt);
bottom = rb_time_val_cnt(bottom, cnt);
^ top, bottom, and msb contain outdated data, which do not
match cnt due to 2-bit overflow.
rb_time_split(set, &top2, &bottom2, &msb2);
top2 = rb_time_val_cnt(top2, cnt2);
bottom2 = rb_time_val_cnt(bottom2, cnt2);
if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
return false;
^ This @cnt cmpxchg succeeds because it uses the re-read cnt
is used as expected value.
Sure. And I believe you did find another bug. If we read the cnt first,
before reading val, then it would not be outdated.
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
return false;
^ these cmpxchg have just used the outdated @val as expected
values, even though the content of the rb_time was modified
by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
means those cmpxchg can fail not only due to being interrupted
by another write or cmpxchg, but also simply due to expected
value mismatch in any of the fields, which will then cause
Yes, it is expected that this will fail for being interrupt any time during
this operation. So it can only fail for being interrupted. How else would
the value be mismatched if this function had not been interrupted?
following __rb_time_read() to fail until a rb_time_set() is done.
How so? If this had failed, it's because it was interrupted by something
that did the write. The point here is to not modify the value if any of
these failed. If any of the cmpxchg() failed, it means whatever interrupted
it did a rb_time_set(), and that means the value will be valid if a
__rb_time_read() was done on it again.
It doesn't need a rb_time_set() in this context to make it valid again.
That's because an interrupting context had already done that.
return true;
So this overflow scenario on top of cmpxchg does not cause corrupted
time stamps, but does cause subsequent __rb_time_read() and rb_time_cmpxchg()
to fail until an eventual rb_time_set().
I still don't see that.
Although, I should also change this to be:
struct rb_time_struct {
local_t cnt;
local_t msb;
local_t top;
local_t bottom;
};
To match the order of bits as mentioned above.
static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
{
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
## So the value has to succeed to continue. This is why I don't think order
## matters between them.
return false;
if (val != expect)
## Must also be what was expected
return false;
cnt = local_read(&t->cnt);
## We read the full 32 bits here.
if ((cnt & 3) != cnt2)
## This is mostly a paranoid check. For this to fail, the interrupting
## context had to write a full timestamp that this context expected,
## otherwise the (val != expect) would be true.
As I state in my scenario above, the interrupting context can happen
after the (val != expect) check.
Which I agree should be fixed. That is, we need to have:
static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
{
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
+ /* Interrupting writes should make this function fail */
+ cnt = local_read(&t->cnt);
+
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
return false;
if (val != expect)
return false;
- cnt = local_read(&t->cnt);
if ((cnt & 3) != cnt2)
return false;
[..]
return false;
cnt2 = cnt + 1;
## We take the 32 bit number and add 1 to it
rb_time_split(val, &top, &bottom, &msb);
top = rb_time_val_cnt(top, cnt);
bottom = rb_time_val_cnt(bottom, cnt);
rb_time_split(set, &top2, &bottom2, &msb2);
top2 = rb_time_val_cnt(top2, cnt2);
bottom2 = rb_time_val_cnt(bottom2, cnt2);
## Now the above takes the value to what was expected and sprinkles the cnt
## on it as "salt"
if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
return false;
## if something came in here, we fail immediately with no corruption. This
## cmpxchg() is not affected by 4 writes
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
## if we fail here, it means that something came in and wrote all values
## making everything correct again.
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
return false;
## The same is true for all the above.
Not if the interrupting context happens right after the (val != expect) check,
as stated in my scenario.
And is fixed with what I mentioned.
return true;
}
. The point is that a cmpxchg() should not corrupt a
write that was done by an interrupting context. The logic can fail if the
cmpxchg wants to update one of the fields to a new number, but the
interrupting write kept it the same 4 times. That is, it did not update the
number.
I'm failing to see how letting a cmpxchg succeed in a case where a store
just happened to write all of its expected values would be a bug ?
Because it could be:
top = 0x1
bottom = 0xffff0000
And the interrupt caused that to be:
top = 0x2
bottom = 0x00000000
But the lower context wanted it to be:
top = 0x1
bottom = 0xffffff00
We don't want the end result to be:
top = 0x1
bottom = 0x00000000
Because of a false positive "match". (Note, the above isn't a good example,
but I'm too tired to think of one that will actually cause the problem. But
I think you can get the gist of it).
And if the nested writes happen bewtween the cmpxchg to top and bottom, and
the cmpxchg bottom happen to expect exactly the content of the write, then
it would increment the 2-bit cnt of bottom to a value which won't match
top/msb, which would cause following reads to fail.
Yes, if there's a false match (a match that should not have happened), then
yes, it will corrupt the counter and make reads fail. But currently it's
near impossible to get that false match. But I think we should make it
totally impossible to do so.
Yes, the nesting approach might work better than a 2-bit counter for tracking
interruption of reads/cmpxchg by stores/cmpxchg.
Although, it may need at least the LSB of the count too, and we make it
three bits, where the LSB is the LSB of the count and bits 1 and 2 are the
context level. That's because we still need to have the interrupting
context know that the words are in the process of being updated. All it
needs is a toggle, because that bit will go from 0 to 1 in any given
context.
That way, if a irq interrupts a soft irq, it may see:
msb: 0 1 0
top: 0 1 0
bottom: 0 1 1
And know that it interrupted it between top and bottom.
- After a cmpxchg fails between updates to top and msb, a write is
needed before read and cmpxchg can succeed again. I am not entirely
sure the rest of the ring buffer handles this correctly.
Note, a cmpxchg() can only fail if it was interrupted by a higher context.
The higher context would be doing a write for the cmpxchg() to fail. If a
cmpxchg() fails, it means that a higher context has already modified it and
in fact, if a cmpxchg() fails, a read should be guaranteed to succeed if
done after the failure, because the higher context already did the write.
Not in the 2-bit overflow scenario I detailed above.
I still see moving the read of cnt to the beginning as fixing that.
*
- * - Reads may fail if it interrupted a modification of the time stamp.
+ * - Read may fail if it interrupted a modification of the time stamp.
* It will succeed if it did not interrupt another write even if
* the read itself is interrupted by a write.
+ * A read will fail if it follows a cmpxchg which failed between
+ * updates to its top and msb bits, until a write is performed.
+ * (note: this limitation may be unexpected in parts of the
+ * ring buffer algorithm)
* It returns whether it was successful or not.
*
- * - Writes always succeed and will overwrite other writes and writes
+ * - Write always succeeds and will overwrite other writes and writes
Hmm, Not sure I agree with the above. It should be plural, as in "All
writes".
Then we should pick either "writes/reads" and "they", or "A write/A read"
and "it", but not a mix.
Where do you see it mixed?
* Other than that, it acts like a normal cmpxchg.
*
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- * (bottom being the least significant 30 bits of the 60 bit time stamp).
+ * The 64-bit time stamp is broken up, from most to least significant,
+ * in: msb, top and bottom fields, of respectively 4, 30, and 30 bits.
*
- * The two most significant bits of each half holds a 2 bit counter (0-3).
+ * The two most significant bits of each field hold a 2-bit counter (0-3).
* Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- * top and bottom together make a valid 60 bit number.
+ * When reading the top, bottom, and msb fields, if the two counter bits
+ * match, then the combined values make a valid 64-bit number.
+ *
+ * Counter limits. The following situations can generate overflows that
+ * produce corrupted time stamps:
+ *
+ * - A read or a write interrupted by 2^32 writes or cmpxchg.
+ *
+ * - A cmpxchg interrupted by 4 writes or cmpxchg.
+ * (note: this is not sufficient and should be fixed)
Remember, it's not just 4 writes that cause it to fail, but also those 4
writes must have the same value, as the cmpxchg() doesn't just look at the
2 bits, it looks at the rest of the value too.
It would not require all 4 of the writes to store the same value, just the
last one.
Although I detailed an overflow scenario that causes reads to fail after a
partially successful cmpxchg, I'm currently failing to understand how the 4
writes would cause a read to observe an actual corrupted value.
reads detect happening within a write. So there is no "4 writes" when doing
a read. The read cares about what it interrupted, not what interrupted it.
The order of the cmpxchg that your patch fixed does affect this, because it
missed the "bottom" update.
Having a single toggle and the context level should be sufficient. As two
bits will tell you which context updated the timestamp, which is useful for
knowing it got interrupted, and the toggle bit is to let interrupting
context know the timestamp is being updated.