Re: [RFC][PATCH] ipc: Remove IPCMNI

From: Manfred Spraul
Date: Thu Mar 29 2018 - 04:47:54 EST


Hello together,

On 03/29/2018 04:14 AM, Davidlohr Bueso wrote:
Cc'ing mtk, Manfred and linux-api.

See below.

On Thu, 15 Mar 2018, Waiman Long wrote:

On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
Waiman Long <longman@xxxxxxxxxx> writes:

On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
The define IPCMNI was originally the size of a statically sized array in
the kernel and that has long since been removed. Therefore there is no
fundamental reason for IPCMNI.

The only remaining use IPCMNI serves is as a convoluted way to format
the ipc id to userspace. It does not appear that anything except for
the CHECKPOINT_RESTORE code even cares about this variety of assignment
and the CHECKPOINT_RESTORE code only cares about this weirdness because
it has to restore these peculiar ids.

My assumption is that if an array is recreated, it should get a different id.
ÂÂÂ a=semget(1234,,);
ÂÂÂ semctl(a,,IPC_RMID);
ÂÂÂ b=semget(1234,,);
now a!=b.

Rational: semop() calls only refer to the array by the id.
If there is a stale process in the system that tries to access the "old" array and the new array has the same id, then the locking gets corrupted.
Therefore make the assignment of ipc ids match the description in
Advanced Programming in the Unix Environment and assign the next id
until INT_MAX is hit then loop around to the lower ids.

Ok, sounds good.
That way we really cycle through INT_MAX, right now a==b would happen after 128k RMID calls.
This can be implemented trivially with the current code using idr_alloc_cyclic.

Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?

semop() that do not sleep are fairly fast.
The same applies for msgsnd/msgrcv, if the message is small enough.

@Davidlohr:
Do you know if there are application that frequently call semop() and it doesn't have to sleep?
From the scalability that was pushed into the kernel, I assume that this exists.

I have myself only checked postgresql, and postgresql always sleeps.
(and this was long ago)
To make it possible to keep checkpoint/restore working I have renamed
the sysctls from xxx_next_id to xxx_nextid. That is enough change that
a smart CRIU implementation can see that what is exported has changed,
and act accordingly. New kernels will be able to restore the old id's.

This code still needs some real world testing to verify my assumptions.
And some work with the CRIU implementations to actually add the code
that deals with the new for of id assignment.

It means that all existing checkpoint/restore application will not work with a new kernel.
Everyone must first update the checkpoint/restore application, then update the kernel.

Is this acceptable?

--
ÂÂÂ Manfred