RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

From: Song Bao Hua (Barry Song)
Date: Thu Apr 29 2021 - 18:32:40 EST




> -----Original Message-----
> From: Dave Hansen [mailto:dave.hansen@xxxxxxxxx]
> Sent: Friday, April 30, 2021 9:39 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; tiantao (H)
> <tiantao6@xxxxxxxxxxxxx>; corbet@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx
> Cc: linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rafael J.
> Wysocki <rafael@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Valentin
> Schneider <valentin.schneider@xxxxxxx>; Dave Hansen
> <dave.hansen@xxxxxxxxxxxxxxx>; Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
>
> On 4/29/21 2:08 PM, Song Bao Hua (Barry Song) wrote:
> >> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> >> something?
> > This kind of cpu list ABIs have been there for many years but have
> > never been documented well.
> >
> > We have two ABIs:
> > xxx_cpus - in format like 3333333333
> > xxx_cpus_list - in format like 0,3,5,7,9,11,13....
> >
> > xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
> > include any more useful information than xxx_cpus.
> >
> > xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
> > in kconfig nowadays.
> >
> > if people all agree the trimmed list is a break of ABI, I think we may
> > totally remove this list. For these days, this list probably has never
> > overflowed but literally this could happen.
> >
> > thoughts?
>
> From what Greg said, it sounds like removing the BUILD_BUG_ON(), making
> it a binary sysfs file, and making it support arbitrary lengths is the
> way to go.

I am actually more concerned on xxx_cpus_list than xxx_cpus.

xxx_cpus has never overflowed. Though there is a BUILD_BUG_ON(), but the
maximum NR_CPUS is 8096, it is still far from overflow.
8096 /32 * 9 = 2277
as 2277 < 4096

While NR_CPUS gets to ~14500, for a 4KB page, xxx_cpus will overflow.
But I don't know when hardware will reach there. If we reach there,
the existing code to describe topology and schedule tasks might also
need rework.

On the other hand, lscpu, hwloc, numactl etc are using the existing
hex bitmap ABI:

$ strace lscpu 2>&1 | grep topology
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/thread_siblings", F_OK) = 0
openat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/thread_siblings",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/core_siblings",
O_RDONLY|O_CLOEXEC) = 3
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/book_siblings", F_OK) = -1
ENOENT (No such file or directory)
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu1/topology/thread_siblings", F_OK) = 0
openat(AT_FDCWD,
...

$ strace numactl --hardware 2>&1 | grep cpu
openat(AT_FDCWD, "/sys/devices/system/cpu",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3

If we move to binary, it means we have to change those applications.

For this moment, I'd argue we keep cpu bitmap ABI as is and defer this
issue to when we really get so many cores.

Thanks
Barry