Re: [PATCH 6/6] selinux: improve symtab string hashing

From: Paul Moore
Date: Wed Sep 13 2023 - 13:46:26 EST


On Thu, Sep 7, 2023 at 1:44 PM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
> On Fri, Aug 18, 2023 at 11:12 AM Christian Göttsche
> <cgzones@xxxxxxxxxxxxxx> wrote:
> >
> > The number of buckets is calculated by performing a binary AND against
> > the mask of the hash table, which is one less than its size (which is a
> > power of two). This leads to all top bits being discarded, requiring
> > for short or similar inputs a hash function with a good avalanche
> > effect.
> >
> > Use djb2a:
> >
> > # current
> > common prefixes: 7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
> > classes: 134 entries and 100/256 buckets used, longest chain length 5, sum of chain length^2 234
> > roles: 15 entries and 6/16 buckets used, longest chain length 5, sum of chain length^2 57
> > types: 4448 entries and 3016/8192 buckets used, longest chain length 41, sum of chain length^2 14922
> > users: 7 entries and 3/8 buckets used, longest chain length 3, sum of chain length^2 17
> > bools: 306 entries and 221/512 buckets used, longest chain length 4, sum of chain length^2 524
> > levels: 1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
> > categories: 1024 entries and 400/1024 buckets used, longest chain length 4, sum of chain length^2 2740
> >
> > # patch
> > common prefixes: 7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
> > classes: 134 entries and 101/256 buckets used, longest chain length 3, sum of chain length^2 210
> > roles: 15 entries and 9/16 buckets used, longest chain length 3, sum of chain length^2 31
> > types: 4448 entries and 3459/8192 buckets used, longest chain length 5, sum of chain length^2 6778
> > users: 7 entries and 5/8 buckets used, longest chain length 3, sum of chain length^2 13
> > bools: 306 entries and 236/512 buckets used, longest chain length 5, sum of chain length^2 470
> > levels: 1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
> > categories: 1024 entries and 518/1024 buckets used, longest chain length 7, sum of chain length^2 2992
> >
> > Signed-off-by: Christian Göttsche <cgzones@xxxxxxxxxxxxxx>
> > ---
> > security/selinux/ss/symtab.c | 18 ++++++++----------
> > 1 file changed, 8 insertions(+), 10 deletions(-)
> >
> > diff --git a/security/selinux/ss/symtab.c b/security/selinux/ss/symtab.c
> > index 43d7f0319ccd..b6761b96eee4 100644
> > --- a/security/selinux/ss/symtab.c
> > +++ b/security/selinux/ss/symtab.c
> > @@ -11,16 +11,14 @@
> >
> > static unsigned int symhash(const void *key)
> > {
> > - const char *p, *keyp;
> > - unsigned int size;
> > - unsigned int val;
> > -
> > - val = 0;
> > - keyp = key;
> > - size = strlen(keyp);
> > - for (p = keyp; (p - keyp) < size; p++)
> > - val = (val << 4 | (val >> (8*sizeof(unsigned int)-4))) ^ (*p);
> > - return val;
> > + /* djb2a */
>
> Do we need/want something that specifies the author/license (I assume
> public domain) of this code?

It would be a good idea, yes. I spent some time looking around, and
while there are many references to djb2 (and the djb2a xor version), I
had a hard time finding an explicit license statement from DJB.
However, in the cdb v0.75 release there is a source file, cdb_hash.c,
which has effectively the same hash (same algorithm, different
implementation) with a "/* Public domain. */" statement at the top.
IANAL, but I think that is good enough.

Christian, can you update this patch comments with this info?

http://cr.yp.to/cdb.html

> > + unsigned int hash = 5381;
> > + unsigned char c;
> > +
> > + while ((c = *(const unsigned char *)key++))
> > + hash = ((hash << 5) + hash) ^ c;
> > +
> > + return hash;
> > }
> >
> > static int symcmp(const void *key1, const void *key2)
> > --
> > 2.40.1

--
paul-moore.com