Lumiera
The new emerging NLE for GNU/Linux

This page is for collecting know-how related to hash functions and hash tables.

The original STL was lacking proper support for hashtables, hash based associative arrays and hash calculation in general. To quite some developers, hash tables feel like some kind of impure data structure — unfortunately the properties of modern CPUs turned the balance significantly in favour of hash tables due to memory locality. Pointer based datastructures can’t be considered especially performant as they were in the good old times.

The tr1 extension and the new C++11 standard amended the problem by defining a framework for hash functions and hash tables. When sticking to some rules, custom written hash functions can be picked up automatically by the standard library and -containers.

Standard Hash Definitions

Hash values

hash values are unsigned integral numbers of type size_t

Basically this means that the range of hash values roughly matches the memory address space. But it also means that this range is platform dependant (32 or 64bit) and — given the usual hash calculation based on modulus (wrap around) — that generated hash values are nonportable.

Hash function

a hash function calculates a hash value for objects of its argument type. Thus, for every supported type, there is a dedicated hash function. Quite some hash functions are generated from function templates though.

Hash functor

a function object able to calculate hash values when invoked. The standard library and the corresponding boost libraries accept functors of type hash<TY> to calculate hash values for objects or values of type TY

Hash based containers

While the standard Set and Map types (including the Multiset and Multimap) are based on balanced binary trees, the new C++11 standard includes hash based variants (with name prefix unordered_). These hashtable based containers require a hash<KEY> functor to be able to derive the hash value of any encountered key value. Hash functors may be provided as additional type parameter to the container; if omitted, the compiler tries to find a (maybe custom defined) hash functor by ADL (see below)

C++11 versus Boost

The Boost library functional-hash provided the foundation for the definition now accepted into the new C++ standard. Yet the boost library provides some additional facilities not part of the standard. Thus we’re bound to choose

  • either including <functional> and using std::hash

  • or including <boost/functional-hash> and using boost::hash

The boost version additionally provides pre defined hash functors for STL containers holding custom types — and it provides an easy to use extension mechanism for writing hash functions for custom types. Effectively this means that, assuming usage of the boost-include, the actual implementation and the way it is picked up is different but compatible to the standard way.

Boost: hashing custom types

The extension mechanism used by the boost version is best explained by looking at the code

boost/functional/hash/extensions.hpp
    template <class T> struct hash
        : std::unary_function<T, std::size_t>
    {
        std::size_t operator()(T const& val) const
        {
            return hash_value(val);
        }
    }

So this templated standard implementation just invokes an unqualified function with the name hash_value(val) — when instantiating this template for your custom class or type, the compiler will search this function not only in the current scope, but also in the namespace defining your custom type T (this mechanism is known as “Argument Dependant Lookup”). Meaning that all we’d need to do is to define a free function or friend function named hash_value alongside with our custom data types (classes).

To further facilitate providing custom hash functions, boost defines a function boost::hash_combine(size_t seed, size_t hashValue), allowing to chain up the calculated hash values of the parts forming a composite data structure.

  • see Lumiera’s asset::Category for a simple usage example

  • our lib::Symbol datatype uses the standard implementation of a string hash function combining the individual character’s hashes.

LUID values

Lumiera’s uniform identifier values shouldn’t be confused with regular hash values. The purpose of LUID values is to use just plain random numbers as ID values. But, because of using such a incredibly large number space (128bit), we can just assume any collision between such random LUID to be so unlikely as to reasonably ignore this possibility altogether. Let’s say, the collision of random LUID values won’t ever happen, same as the meltdown of an atomic power plant, which, as we all know, won’t ever happen either.

Relation to hash values

When objects incorporate sich an unique LUID, this provides for a prime candidate to derive hash values as a side-effect of that design: Since incorporating an LUID typically means that this object has an distinguishable identity, all objects with the same LUID should be considered equivalent and thus hash to the same value. Consequently we can just use a size_t prefix of the LUID bitstring as hash value, without any further calculations.