Re: [PATCH 4/5] Add a sysconf syscall

From: Ingo Molnar
Date: Tue May 17 2011 - 07:25:42 EST



Andi,

AFAICS you have not replied to my detailed analysis below yet (you only replied
to a sub-reply, not addressing most of the points i raised) - should i expect a
reply in the days to come, or will you follow your past pattern of
unpredictably ignoring review concerns?

Firstly i think the obvious canonical fix is to expose the number of CPUs in a
separate file in /proc and not in /proc/stat which is expensive to construct
and parse. Also arguably sleepycat should not need to read the number of CPUs
all the time - why does it do that?

Secondly, even assuming that fast access to all these system globals is
desirable i question the choice to include rlimits and nsems_max: they are not
really system globals (so why are they in sysconf() to begin with?) and they
can already be queried cheaply using getrlimits()/semctl().

These are the values that make the sysconf() information per task and
complicate the data structure.

If we look at all non-rlimit values:

_SC_CLK_TCK: return HZ;
_SC_UIO_MAXIOV: return UIO_MAXIOV;
_SC_PAGESIZE: return PAGE_SIZE;
_SC_SYMLOOP_MAX: return SYMLOOP_MAX;
_SC_PHYS_PAGES: return totalram_pages;
_SC_AVPHYS_PAGES: return nr_free_pages();
_SC_NPROCESSORS_CONF: return num_possible_cpus();
_SC_NPROCESSORS_ONLN: return num_online_cpus();
_SC_NGROUPS_MAX: return NGROUPS_MAX;

These are all global, read-mostly values.

Were these rlimits not included we could do a sane global vdso page with the
NPROCESSORS number exposed (which is what the sleepycat example you cited
really needs it appears) and expose it to glibc using a new auxiliary entry (or
via a vsyscall that returns the pointer).

It would become very simple and no COW trickery is needed in that case either.

See my arguments in more detail below.

Thanks,

Ingo

* Ingo Molnar <mingo@xxxxxxx> wrote:

> * Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
>
> > > What glibc does (opening /proc/stat) is rather stupid, but i think your
> > > syscall
> >
> > I don't think it has any other choice today. [...]
>
> Sure it has a choice: most of the sysconf values are global so creating a
> permanent mmap()-able data in /tmp or elsewhere and mapping it unless it's
> inaccessible isnt that particularly hard to cache most of the system-wide
> constants, now is it?
>
> The CPU count could be updated from CPU hotplug events.
>
> rlimits can be queried using getrlimit().
>
> If at that point glibc said to us: "hey, lets work in making it even faster"
> then there would be no reason for us to criticise glibc. Right now gibc does
> not even *try* to be smart about it.
>
> > [...] So if anything is "stupid" it is the kernel for not providing efficient
> > interfaces for this.
> >
> > > Note that these are mostly constant or semi-constant values that are
> > > updated very rarely:
> >
> > That's not true. Most of them are dynamic. Take a look at the patch.
> > Also several of those have changed recently.
>
> As i said they are mostly constant or semi-constant values that are updated
> very rarely.
>
> If you think that i am wrong then do me the basic courtesy of mentioning the
> examples that you think disprove my claim, instead of broadly pointing me to
> your patch ...
>
> > > If glibc is stupid and reads /proc/stat to receive something it could cache
> > > or mmap() itself then hey, did you consider fixing glibc or creating a sane
> > > libc?
> >
> > Caching doesn't help when you have a workload that exec()s a lot. Also some
> > of these values can change at runtime.
>
> Here you modify your claim. Now it's not 'dynamic' anymore but 'can change'?
>
> Which one is it now, "mostly constant or semi-constant values that are updated
> very rarely" as i claim, "dynamic" as you first claimed or "can change" as you
> claim here (which is also pretty ambiguous)?
>
> > > If we *really* want to offer kernel help for these values even then your
> > > solution is still wrong: then the proper solution would be to define a
> > > standard *data* structure and map it as a vsyscall *data* page -
> > > essentially a kernel-guaranteed data mmap(), with no extra syscall needed!
> >
> > That's quite complicted because several of those are dynamically computed
> > based on other values. Sometimes they are also not tied to the mm_struct --
> > like the vsyscall is. For example some of the rlimits are per task, not VM.
> > Basically your proposal doesn't interact well with clone().
>
> Threads with different rlimits but shared VM are extreme rarities.
>
> Could we please concentrate on the common case? A '-1' in the data page can let
> the code fall back to some slow path.
>
> Also note that rlimit itself already has an interface to query them:
> getrlimit(). So if you do not want the complexity of caching rlimits in the
> data page you do not have to start with that complexity.
>
> [ But it can be done: modifying the rlimit (which predominantly only happens in
> the login process) is rare and happens in a parent task. ]
>
> > Even if we ignored that semantic problem it would need another writable page
> > per task because the values cannot be shared.
>
> Sure most of the values can be shared.
>
> Most of them are exactly one of a low number of variants for all tasks in the
> system, for typical Linux bootups. I suspect if the data page was COW-ed but
> inherited across exec() it would behave exactly the way it should be: inherited
> by all tasks but privatized if a task modifies it for itself and all children.
>
> Also, the first step could be simplified by not exposing rlimits - as rlimits
> are already exposed via other channels.
>
> > Also I never liked the idea of having more writable pages per task, [...]
>
> If you limit it to your own faulty implementation then i'm not surprised that
> you do not like it.
>
> > [...] It increases the memory footprint of a single process more. Given a 4K
> > page is not a lot, but lots of 4K pages add up. Some workloads like to have
> > lots of small processes and I think that's a valuable use case Linux should
> > stay lean and efficient at.
> >
> > [OK in theory one could do COW for the page and share it but that would get
> > really complicated]
>
> Why would it get complicated? It's not writable to user-space, that's all that
> is special about it.
>
> > I also don't think it's THAT performance critical to justify the vsyscall.
>
> You apparently did not understand the gist of my point: it's the CONCEPT of
> adding a crappy catch-all sysconf() interface that sucks IMHO. It's simply bad
> taste.
>
> If you want to expose data then expose the data intelligently, not some poor
> system call interface that is also slower.
>
> > The simple syscall is already orders of magnitude faster than /proc, and
> > seems to solve the performance problems we've seen completely.
>
> A handful of horse manure is less stinky than a big pile of it, still i wouldnt
> want to eat either.
>
> > It's also simple and straight forward and simple to userstand and maintain. I
> > doubt any of that would apply to a vsyscall solution.
>
> Note: i did not suggest a vsyscall, but a vsyscall *data area*. There's a big
> difference between the two!
>
> It could be passed down to user-space using a new auxiliary entry (see
> fs/binfmt_elf.c), as it's really part of a task's environment conceptually.
>
> > I don't think the additional effort for a vsyscall would be worth it at this
> > point, unless there's some concrete example that would justify it. Even then
> > it wouldn't work for some of the values.
> >
> > Also a vsyscall doesn't help on non x86 anyways.
>
> There's nothing x86 about aux entries.
>
> > As for efficiency: I thought about doing a batched interface where
> > the user could pass in an array of values to fill in. But at least for the
> > workloads I looked at the application usually calls sysconf() where
> > the array size would be always 1. And that's the standard interface.
> > This might be still worth doing, but I would like to see a concrete
> > use case first.
> >
> > > That could have other uses as well in the future.
> >
> > Hmm for what?
>
> *Iff* the concensus is that we are fine with a new page per task/thread then we
> could use it for all sorts of nifty things like the current CPU id for example.
>
> > Note we already have a fast mechanism to pass some thing to glibc in the aux
> > vector.
>
> So when you got so far in your reply why did you not delete your above (false)
> outrage about the vsyscall, which i never suggested and which you thus forced
> me to reply to?
>
> > > That way it would much faster than your current code btw.
> > >
> > > So unless there are some compelling arguments in favor of sys_sysconf()
> > > that i missed, i have to NAK this:
> >
> > Well see above for lots of reasons you missed. They are understandable
> > mistakes for someone who first looks at the problem though. I'll attempt to
> > improve the documentation next time.
>
> I don't think your condescending tone towards me is warranted or fair, i
> offered a fair technical criticism of your patch series. Why are you
> attacking me like this?
>
> Thanks,
>
> Ingo

--
Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/