RE: [PATCH v7 2/4] topology: use bin_attribute to break the size limitation of cpumap ABI

From: Song Bao Hua (Barry Song)
Date: Fri Jul 16 2021 - 20:16:57 EST




> -----Original Message-----
> From: Yury Norov [mailto:yury.norov@xxxxxxxxx]
> Sent: Saturday, July 17, 2021 8:04 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> andriy.shevchenko@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> dave.hansen@xxxxxxxxx; linux@xxxxxxxxxxxxxxxxxx; rafael@xxxxxxxxxx;
> rdunlap@xxxxxxxxxxxxx; agordeev@xxxxxxxxxxxxx; sbrivio@xxxxxxxxxx;
> jianpeng.ma@xxxxxxxxx; valentin.schneider@xxxxxxx; peterz@xxxxxxxxxxxxx;
> bristot@xxxxxxxxxx; guodong.xu@xxxxxxxxxx; tangchengchang
> <tangchengchang@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>;
> yangyicong <yangyicong@xxxxxxxxxx>; tim.c.chen@xxxxxxxxxxxxxxx; Linuxarm
> <linuxarm@xxxxxxxxxx>; tiantao (H) <tiantao6@xxxxxxxxxxxxx>
> Subject: Re: [PATCH v7 2/4] topology: use bin_attribute to break the size
> limitation of cpumap ABI
>
> On Fri, Jul 16, 2021 at 08:49:58AM +0000, Song Bao Hua (Barry Song) wrote:
> > Hi Yury,
> > Not sure if I have totally got your idea. But please see if the below
> > is closer to what you prefer.
> >
> > I haven't really tested it. But this approach can somehow solve the
> > problem you mentioned(malloc/free and printing is done 1000times for
> > a 1MB buffer which is read 1K each time).
> >
> > Bitmap provides some API to alloc and return print_buf:
> >
> > +ssize_t bitmap_get_print_buf(bool list, char **buf, const unsigned long
> *maskp,
> > + int nmaskbits)
> > +{
> > + const char *fmt = list ? "%*pbl\n" : "%*pb\n";
> > + ssize_t size;
> > +
> > + size = snprintf(NULL, 0, fmt, nmaskbits, maskp);
> > + *buf = kmalloc_track_caller(size + 1, GFP_KERNEL);
> > + scnprintf(*buf, size + 1, fmt, nmaskbits, maskp);
> > +
> > + return size + 1;
> > +}
> > +
> > +static inline ssize_t
> > +cpumap_get_print_buf(bool list, char **buf, const struct cpumask *mask)
> > +{
> > + return bitmap_get_print_buf(list, buf, cpumask_bits(mask),
> > + nr_cpu_ids);
> > +}
> > +
> > +struct bitmap_print_buf
> > +{
> > + char *buf;
> > + ssize_t size;
> > +};
> > +
> >
> > In bin_attribute, move to get and save the buffer while sysfs entry is
> > read at the first time and free it when file arrives EOF:
> >
> > #define define_id_show_func(name) \
> > static ssize_t name##_show(struct device *dev, \
> > struct device_attribute *attr, char *buf) \
> > @@ -27,9 +53,27 @@ static ssize_t name##_read(struct file *file, struct kobject
> *kobj, \
> > loff_t off, size_t count) \
> > { \
> > struct device *dev = kobj_to_dev(kobj); \
> > - \
> > - return cpumap_print_to_buf(false, buf, topology_##mask(dev->id),
> \
> > - off, count); \
> > + struct bitmap_print_buf *bmb = dev_get_drvdata(dev); \
> > + if (!bmb) { \
> > + bmb = devm_kzalloc(dev, sizeof(*bmb), GFP_KERNEL); \
> > + if (!bmb) \
> > + return -ENOMEM; \
> > + dev_set_drvdata(dev, bmb); \
> > + } \
> > + /* for the 1st time, getting the printed buffer */ \
> > + if (!bmb->buf) \
> > + bmb->size = cpumap_get_print_buf(false, &bmb->buf, \
> > + topology_##mask(dev->id)); \
> > + /* when we arrive EOF, free the printed buffer */ \
> > + if (off >= bmb->size) { \
> > + kfree(bmb->buf); bmb->buf = NULL;
> \
> > + return 0; \
> > + } \
> > + /* while a large printed buffer is read many times, we reuse \
> > + * the buffer we get at the 1st time \
> > + */ \
> > + strncpy(buf, bmb->buf + off, count); \
> > + return min(count, bmb->size - off); \
> > } \
> > \
> > This means a huge change in drivers though I am not sure Greg is
> > a fan of this approach. Anyway, "1000 times" is not a real case.
> > Typically we will arrive EOF after one time.
>
> Not a real case in your driver doesn't mean not a real case at all.
> You're adding your code to the very basic core layer, and so you'd
> consider all possible scenarios, not only your particular one.
>

Generally yes. And generally I agree with your point. That point is
exactly what I have always adhered to in software design. But my point
I have been arguing is that the new APIs are for sysfs ABI only. So it
is not particular one, it is common.

In my understanding, only ABI things to users will need to print bitmap
to mask or list. Rarely a kernel module will need it. Kernel modules
would be running real and/or/andnot bit operations on binary bitmap
directly.

Anyway, I'm glad to take a step in the way you prefer.

> On the other hand, if you add your function(s) at drivers/base/node.c
> and explain that O(nbits**2) 'is not a real case' in _this_ driver -
> I think it will be acceptable. Maybe this is your choice...
>
> > Thanks
> > Barry
>
> > Not sure if I have totally got your idea. But please see if the below
> > is closer to what you prefer.
> >
> > I haven't really tested it. But this approach can somehow solve the
> > problem you mentioned(malloc/free and printing is done 1000times for
> > a 1MB buffer which is read 1K each time).
> >
> > Bitmap provides some API to alloc and return print_buf:
>
> I'm not too familar to the topology things, and in fact never exposed
> anything thru the sysfs.
>
> From general knowledge, it's better to allocate memory for the string
> on file creation to avoid situation when kernel has no memory to allocate
> it when user tries to read.
>
> So from my perspective, the most straightforward solution would be:
>
> register(cpumask)
> {
> size_t max_size = cpumap_max_string_size(list, cpumask)
> void *buf = kmalloc(max_size, ...);
>
> sysfs_create_file(..., buf)
> }
>
> unregister()
> {
> kfree(buf);
> }
>
> show()
> {
> snprintf(buf, max_size, "*pbl", cpumask);
> }
>

Generally good idea. However, for sysfs ABI entries, it might not be
that true.

A sysfs entry might never be read for its whole life. As I explained
before, a sysfs entry - especially for list, is randomly "cat" by users.
Many of them won't be read forever. And after they are read once, they
will probably never be read again. The operations to read ABI could be
random and rare. Performance wouldn't be a concern.

To avoid holding the memory which might never be used, it is better to
allocate and free the memory during runtime. I mean to allocate in show()
and free in show(), aka, to do it on demand.

For example, for a server with 256CPU and each cpu has dozens of sysfs ABI
entries, only a few of sysfs list entries might be randomly "cat" by users.
Holding 256*entries memory doesn't look good.

> This would require to add bitmap_max_string_size(list, bitmap, nbits),
> but it's O(1), and I think, others will find it helpful.

What about getting size and memory at the same time?

ssize_t bitmap_get_print_buf(bool list, char **buf, const unsigned long
*maskp, int nmaskbits)

ssize_t cpumap_get_print_buf(bool list, char **buf, const struct cpumask *mask);

This API returns the size of printed buffer, and it also gets the
printed result saved in *buf. Then drivers don't need to do three
steps:

1. get cpumap buffer size which is your cpumap_max_string_size()
2. allocate memory for buffer according to size got in step 1
3. print bitmap(cpumap) to buffer by "pbl"

It will only need to call bitmap_get_print_buf() and all three
things are done inside bitmap_get_print_buf().

How to use the size and memory allocated in cpumap_get_print_buf
will be totally up to users.

The other benefit for this is that if we get string size during initialization,
and then we print in show() entries, the size got at the beginning might be not
enough as system topology might have changed. Sysfs ABI reflects the status of
system at this moment.

>
> Again, I'm not professional with sysfs, and fully admit that it might
> be wrong.

Never mind.

Thanks
Barry