Re: [PATCH v3 6/6] RISC-V: Do not use cpumask data structure for hartid bitmap

From: Geert Uytterhoeven
Date: Fri Jan 28 2022 - 03:39:58 EST

Next message: Joe Perches: "Re: [PATCH] scsi: megaraid: cleanup formatting of megaraid"
Previous message: Tianjia Zhang: "[PATCH] crypto: tcrypt - remove all multibuffer ahash tests"
In reply to: Andreas Schwab: "Re: [PATCH v3 6/6] RISC-V: Do not use cpumask data structure for hartid bitmap"
Next in thread: Geert Uytterhoeven: "Re: [PATCH v3 6/6] RISC-V: Do not use cpumask data structure for hartid bitmap"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Atish,

On Fri, Jan 28, 2022 at 1:13 AM Atish Patra <atishp@xxxxxxxxxxxxxx> wrote:
> On Thu, Jan 27, 2022 at 12:48 AM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
>> On Thu, Jan 27, 2022 at 2:02 AM Atish Patra <atishp@xxxxxxxxxxxxxx> wrote:
>> > On Wed, Jan 26, 2022 at 1:10 AM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
>> > > On Wed, Jan 26, 2022 at 9:28 AM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
>> > > > On Wed, Jan 26, 2022 at 3:21 AM Atish Patra <atishp@xxxxxxxxxxxxxx> wrote:
>> > > > > On Tue, Jan 25, 2022 at 2:26 PM Jessica Clarke <jrtc27@xxxxxxxxxx> wrote:
>> > > > > > On 20 Jan 2022, at 09:09, Atish Patra <atishp@xxxxxxxxxxxx> wrote:
>> > > > > > > Currently, SBI APIs accept a hartmask that is generated from struct
>> > > > > > > cpumask. Cpumask data structure can hold upto NR_CPUs value. Thus, it
>> > > > > > > is not the correct data structure for hartids as it can be higher
>> > > > > > > than NR_CPUs for platforms with sparse or discontguous hartids.
>> > > > > > >
>> > > > > > > Remove all association between hartid mask and struct cpumask.
>> > > > > > >
>> > > > > > > Reviewed-by: Anup Patel <anup@xxxxxxxxxxxxxx> (For Linux RISC-V changes)
>> > > > > > > Acked-by: Anup Patel <anup@xxxxxxxxxxxxxx> (For KVM RISC-V changes)
>> > > > > > > Signed-off-by: Atish Patra <atishp@xxxxxxxxxxxx>
>> > > >
>> > > > > I am yet to reproduce it on my end.
>> > > > > @Geert Uytterhoeven: can you please try the below diff on your end.
>> > > >
>> > > > Unfortunately it doesn't fix the issue for me.
>> > > >
>> > > > /me debugging...
>> > >
>> > > Found it: after this commit, the SBI_EXT_RFENCE_REMOTE_FENCE_I and
>> > > SBI_EXT_RFENCE_REMOTE_SFENCE_VMA ecalls are now called with
>> > > hmask = 0x8000000000000001 and hbase = 1 instead of hmask = 3 and
>> > > hbase = 0.
>> > >
>> > > cpuid 1 maps to hartid 0
>> > > cpuid 0 maps to hartid 1
>> > >
>> > > __sbi_rfence_v02:364: cpuid 1 hartid 0
>> > > __sbi_rfence_v02:377: hartid 0 hbase 1
>> > > hmask |= 1UL << (hartid - hbase);
>> > >
>> > > oops
>> > >
>> > > __sbi_rfence_v02_call:303: SBI_EXT_RFENCE_REMOTE_FENCE_I hmask
>> > > 8000000000000001 hbase 1
>> > >
>> >
>> > Ahh yes. hmask will be incorrect if the bootcpu(cpu 0) is a higher
>> > hartid and it is trying to do a remote tlb flush/IPI
>> > to lower the hartid. We should generate the hartid array before the loop.
>> >
>> > Can you try this diff ? It seems to work for me during multiple boot
>> > cycle on the unleashed.
>> >
>> > You can find the patch here as well
>> > https://github.com/atishp04/linux/commits/v5.17-rc1

>> > @@ -345,13 +368,21 @@ static int __sbi_rfence_v02(int fid, const
>> > struct cpumask *cpu_mask,
>> > unsigned long arg4, unsigned long arg5)
>> > {
>> > unsigned long hartid, cpuid, hmask = 0, hbase = 0;
>> > - int result;
>> > + int result, index = 0, max_index = 0;
>> > + unsigned long hartid_arr[NR_CPUS] = {0};
>>
>> That's up to 256 bytes on the stack. And more if the maximum
>> number of cores is increased.
>>
>
> Yeah. We can switch to dynamic allocation using kmalloc based on
> the number of bits set in the cpumask.

Even more overhead...

>> > - if (!cpu_mask)
>> > + if (!cpu_mask || cpumask_empty(cpu_mask))
>> > cpu_mask = cpu_online_mask;
>> >
>> > for_each_cpu(cpuid, cpu_mask) {
>> > hartid = cpuid_to_hartid_map(cpuid);
>> > + hartid_arr[index] = hartid;
>> > + index++;
>> > + }
>> > + max_index = index;
>> > + sort(hartid_arr, max_index, sizeof(unsigned long), cmp_ulong, NULL);
>> > + for (index = 0; index < max_index; index++) {
>> > + hartid = hartid_arr[index];
>>
>> That looks expensive to me.
>>
>> What about shifting hmask and adjusting hbase if a hartid is
>> lower than the current hbase?
>
> That will probably work for current systems but it will fail when we have hartid > 64.
> The below logic as it assumes that the hartids are in order. We can have a situation
> where a two consecutive cpuid belong to hartids that require two invocations of sbi call
> because the number of harts exceeds BITS_PER_LONG.

If the number of harts exceeds BITS_PER_LONG, you always need multiple
calls, right?

I think the below (gmail-whitespace-damaged diff) should work:

--- a/arch/riscv/kernel/sbi.c
+++ b/arch/riscv/kernel/sbi.c
@@ -249,7 +249,7 @@ static void __sbi_set_timer_v02(uint64_t stime_value)

static int __sbi_send_ipi_v02(const struct cpumask *cpu_mask)
{
- unsigned long hartid, cpuid, hmask = 0, hbase = 0;
+ unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
struct sbiret ret = {0};
int result;

@@ -258,16 +258,27 @@ static int __sbi_send_ipi_v02(const struct
cpumask *cpu_mask)

for_each_cpu(cpuid, cpu_mask) {
hartid = cpuid_to_hartid_map(cpuid);
- if (hmask &&
- (hartid < hbase || hartid >= hbase + BITS_PER_LONG)) {
- ret = sbi_ecall(SBI_EXT_IPI, SBI_EXT_IPI_SEND_IPI,
- hmask, hbase, 0, 0, 0, 0);
- if (ret.error)
- goto ecall_failed;
- hmask = 0;
+ if (hmask) {
+ if (hartid + BITS_PER_LONG <= htop ||
+ hartid >= hbase + BITS_PER_LONG) {
+ ret = sbi_ecall(SBI_EXT_IPI,
+ SBI_EXT_IPI_SEND_IPI, hmask,
+ hbase, 0, 0, 0, 0);
+ if (ret.error)
+ goto ecall_failed;
+ hmask = 0;
+ } else if (hartid < hbase) {
+ /* shift the mask to fit lower hartid */
+ hmask <<= hbase - hartid;
+ hbase = hartid;
+ }
}
- if (!hmask)
+ if (!hmask) {
hbase = hartid & -BITS_PER_LONG;
+ htop = hartid;
+ } else if (hartid > htop) {
+ htop = hartid;
+ }
hmask |= 1UL << (hartid - hbase);
}

@@ -344,7 +355,7 @@ static int __sbi_rfence_v02(int fid, const struct
cpumask *cpu_mask,
unsigned long start, unsigned long size,
unsigned long arg4, unsigned long arg5)
{
- unsigned long hartid, cpuid, hmask = 0, hbase = 0;
+ unsigned long hartid, cpuid, hmask = 0, hbase = 0, htop = 0;
int result;

if (!cpu_mask || cpumask_empty(cpu_mask))
@@ -352,16 +363,26 @@ static int __sbi_rfence_v02(int fid, const
struct cpumask *cpu_mask,

for_each_cpu(cpuid, cpu_mask) {
hartid = cpuid_to_hartid_map(cpuid);
- if (hmask &&
- (hartid < hbase || hartid >= hbase + BITS_PER_LONG)) {
- result = __sbi_rfence_v02_call(fid, hmask, hbase,
- start, size, arg4, arg5);
- if (result)
- return result;
- hmask = 0;
+ if (hmask) {
+ if (hartid + BITS_PER_LONG <= htop ||
+ hartid >= hbase + BITS_PER_LONG) {
+ result = __sbi_rfence_v02_call(fid, hmask,
+ hbase, start, size, arg4, arg5);
+ if (result)
+ return result;
+ hmask = 0;
+ } else if (hartid < hbase) {
+ /* shift the mask to fit lower hartid */
+ hmask <<= hbase - hartid;
+ hbase = hartid;
+ }
+ }
+ if (!hmask) {
+ hbase = hartid;
+ htop = hartid;
+ } else if (hartid > htop) {
+ htop = hartid;
}
- if (!hmask)
- hbase = hartid & -BITS_PER_LONG;
hmask |= 1UL << (hartid - hbase);
}

Another simpler solution would be to just round hbase down to a
multiple of 32/64 (gmail-whitespace-damaged diff):

--- a/arch/riscv/kernel/sbi.c
+++ b/arch/riscv/kernel/sbi.c
@@ -258,16 +258,16 @@ static int __sbi_send_ipi_v02(const struct
cpumask *cpu_mask)

for_each_cpu(cpuid, cpu_mask) {
hartid = cpuid_to_hartid_map(cpuid);
- if (hmask && ((hbase + BITS_PER_LONG) <= hartid)) {
+ if (hmask &&
+ (hartid < hbase || hartid >= hbase + BITS_PER_LONG)) {
ret = sbi_ecall(SBI_EXT_IPI, SBI_EXT_IPI_SEND_IPI,
hmask, hbase, 0, 0, 0, 0);
if (ret.error)
goto ecall_failed;
hmask = 0;
- hbase = 0;
}
if (!hmask)
- hbase = hartid;
+ hbase = hartid & -BITS_PER_LONG;
hmask |= 1UL << (hartid - hbase);
}

@@ -352,16 +352,16 @@ static int __sbi_rfence_v02(int fid, const
struct cpumask *cpu_mask,

for_each_cpu(cpuid, cpu_mask) {
hartid = cpuid_to_hartid_map(cpuid);
- if (hmask && ((hbase + BITS_PER_LONG) <= hartid)) {
+ if (hmask &&
+ (hartid < hbase || hartid >= hbase + BITS_PER_LONG)) {
result = __sbi_rfence_v02_call(fid, hmask, hbase,
start, size, arg4, arg5);
if (result)
return result;
hmask = 0;
- hbase = 0;
}
if (!hmask)
- hbase = hartid;
+ hbase = hartid & -BITS_PER_LONG;
hmask |= 1UL << (hartid - hbase);
}

But that means multiple SBI calls if you have e.g. hartids 1-64.
The shifted mask solution doesn't suffer from that.
Both solutions don't sort the CPUs, so they are suboptimal in case of
hartid numberings like 0, 64, 1, 65, ...

What do you think?
Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Next message: Joe Perches: "Re: [PATCH] scsi: megaraid: cleanup formatting of megaraid"
Previous message: Tianjia Zhang: "[PATCH] crypto: tcrypt - remove all multibuffer ahash tests"
In reply to: Andreas Schwab: "Re: [PATCH v3 6/6] RISC-V: Do not use cpumask data structure for hartid bitmap"
Next in thread: Geert Uytterhoeven: "Re: [PATCH v3 6/6] RISC-V: Do not use cpumask data structure for hartid bitmap"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]