Lightweight 8 byte hash function algorithm

Lightweight 8 byte hash function algorithm - c++

I need to extract an 8 byte digest from a variable length string so I'm looking for such an algorithm that I will implement in c/c++. That will be part of a digital signature procedure on a microcontroller, so it has to be:
writable in few lines of code, since the firmware has to be kept as little as possible;
low in resource consumption, expecially ram (preferably less than 100 bytes);
strong enough that changing a single character at any point of the string would change the overall digest.
I took a look at existing algorithms such as crc64 but they seems to be too heavy for my platform.

There is no chance to do a secure hash in 64 bits. Even SHA-1 at 160 bit is considered theoretically broken. You should use SHA2-256 if you really care about secure digital signing. If you don't care about security and just want a hash function that avoids non-adversarial collisions just use the following, it is fine:
constexpr uint64 P1 = 7;
constexpr uint64 P2 = 31;
uint64 hash = P1;
for (const char* p = s; *p != 0; p++) {
hash = hash * P2 + *p;
}

As AndrewTomazos-Fathomling said, it's impossible to do a secure hash in 64 bits, so if that's your intention then my advice is STOP, pick up a book and read about cryptographically secure hashing.
If you don't plan on using this as a secure hash and you do not care about collisions or attacks, then the answer he gave you works just fine and you can tweak the primes P1 and P2 as necessary. I will give you another alternative which allows you to do tagged hashing and mixes things up more.
// Disclaimer: I make no claims about the quality of this particular hash - it's
// certainly not a cryptographically secure hash, nor should it *ever* be
// construed as such.
unsigned long long quickhash64(const char *str, unsigned long long mix = 0)
{ // set 'mix' to some value other than zero if you want a tagged hash
const unsigned long long mulp = 2654435789;
mix ^= 104395301;
while(*str)
mix += (*str++ * mulp) ^ (mix >> 23);
return mix ^ (mix << 37);
}

Here is a modified version of a 32 bit version I found in my old source files
static unsigned long long llhash(const char *str)
{
unsigned long long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c;
return hash;
}
But hashing will always result in collisions. Of course some algorithms are better than others.
Edit:
I found the source of the 32 bit version: http://www.cse.yorku.ca/~oz/hash.html

I had the exact same requirement, and I settled for FNV-1A, after dismissing SIP hash (implemented by bloomberg here).
I found an FNV implementation here:
https://github.com/foonathan/string_id/blob/master/hash.hpp
which is:
constexpr uint64_t fnv_basis = 14695981039346656037ull;
constexpr uint64_t fnv_prime = 1099511628211ull;
// FNV-1a 64 bit hash of null terminated buffer
uint64_t fnv1a_hash(const char* str, uint64_t hash = fnv_basis)
{
return *str ? fnv1a_hash(str + 1, (hash ^ *str) * fnv_prime) : hash;
}
It appears he is looping using tail recursion. And stop condition is the null byte.
(boost uses hash_range which is hash_combining each element in chain I guess.)
License is zlib and copyright is Jonathan Müller. Though I'm not convinced a oneliner can be legally licensed if it implements research by other persons (Fowler-Noll-Vo).

Related

Right shift a binary string efficiently in C++

If I have a string that represents an integer in binary form such as
1101101
and I want to circularly right shift it to obtain
1110110
One way I could think of would be converting the string into an int and use (taken from wikipedia)
// https://stackoverflow.com/a/776550/3770260
template <typename INT>
#if __cplusplus > 201100L // Apply constexpr to C++ 11 to ease optimization
constexpr
#endif // See also https://stackoverflow.com/a/7269693/3770260
INT rol(INT val, size_t len) {
#if __cplusplus > 201100L && _wp_force_unsigned_rotate // Apply unsigned check C++ 11 to make sense
static_assert(std::is_unsigned<INT>::value,
"Rotate Left only makes sense for unsigned types");
#endif
return (val << len) | ((unsigned) val >> (-len & (sizeof(INT) * CHAR_BIT - 1)));
}
However, if the string consists of, say, 10^6 char then this does not work as the integer representation exceeds even the range of __int64.
In that case I could think of a solution by looping over the string
//let str be a char string of length n
char temp = str[n - 1];
for(int i = n - 1; i > 0; i--)
{
str[i] = str[i - 1];
}
str[0] = temp;
This solution runs in O(n) due the loop over the length of the string, n. My question is, is there much more efficient way to implement circular shifting for large binary strings?
EDIT
Both input and output are std::strings

You have to move memory one way or another, so your proposed solution is as fast as it gets.
You might also use standard std::string functions:
str.insert(str.begin(), str[n - 1]);
str.erase(str.end() - 1);
or memmove, or memcpy (I don't actually recommend this, it's for an argument)
char temp = str[n - 1];
memmove(str.data() + 1, str.data(), n - 1);
str[0] = temp;
Note that memmove may look faster, but it's essentially the same thing as your loop. It is moving bytes one by one, it's just encapsulated in a different function. This method might be faster for much larger data blocks, of size 1000 bytes or more, since the CPU is optimized to move large chunks of memory. But you won't be able to measure any difference for 10 or 20 bytes.
Moreover, the compiler will most likely run additional optimizations when it sees your for loop, it realizes that you are moving memory and chooses the best option.
The compiler is also good at dealing with std::string methods. These are common operations and the compiler knows the best way to handle it.

C hack for storing a bit that takes 1 bit space?

I have a long list of numbers between 0 and 67600. Now I want to store them using an array that is 67600 elements long. An element is set to 1 if a number was in the set and it is set to 0 if the number is not in the set. ie. each time I need only 1bit information for storing the presence of a number. Is there any hack in C/C++ that helps me achieve this?

In C++ you can use std::vector<bool> if the size is dynamic (it's a special case of std::vector, see this) otherwise there is std::bitset (prefer std::bitset if possible.) There is also boost::dynamic_bitset if you need to set/change the size at runtime. You can find info on it here, it is pretty cool!
In C (and C++) you can manually implement this with bitwise operators. A good summary of common operations is here. One thing I want to mention is its a good idea to use unsigned integers when you are doing bit operations. << and >> are undefined when shifting negative integers. You will need to allocate arrays of some integral type like uint32_t. If you want to store N bits, it will take N/32 of these uint32_ts. Bit i is stored in the i % 32'th bit of the i / 32'th uint32_t. You may want to use a differently sized integral type depending on your architecture and other constraints. Note: prefer using an existing implementation (e.g. as described in the first paragraph for C++, search Google for C solutions) over rolling your own (unless you specifically want to, in which case I suggest learning more about binary/bit manipulation from elsewhere before tackling this.) This kind of thing has been done to death and there are "good" solutions.
There are a number of tricks that will maybe only consume one bit: e.g. arrays of bitfields (applicable in C as well), but whether less space gets used is up to compiler. See this link.
Please note that whatever you do, you will almost surely never be able to use exactly N bits to store N bits of information - your computer very likely can't allocate less than 8 bits: if you want 7 bits you'll have to waste 1 bit, and if you want 9 you will have to take 16 bits and waste 7 of them. Even if your computer (CPU + RAM etc.) could "operate" on single bits, if you're running in an OS with malloc/new it would not be sane for your allocator to track data to such a small precision due to overhead. That last qualification was pretty silly - you won't find an architecture in use that allows you to operate on less than 8 bits at a time I imagine :)

You should use std::bitset.
std::bitset functions like an array of bool (actually like std::array, since it copies by value), but only uses 1 bit of storage for each element.
Another option is vector<bool>, which I don't recommend because:
It uses slower pointer indirection and heap memory to enable resizing, which you don't need.
That type is often maligned by standards-purists because it claims to be a standard container, but fails to adhere to the definition of a standard container*.
*For example, a standard-conforming function could expect &container.front() to produce a pointer to the first element of any container type, which fails with std::vector<bool>. Perhaps a nitpick for your usage case, but still worth knowing about.

There is in fact! std::vector<bool> has a specialization for this: http://en.cppreference.com/w/cpp/container/vector_bool
See the doc, it stores it as efficiently as possible.
Edit: as somebody else said, std::bitset is also available: http://en.cppreference.com/w/cpp/utility/bitset

If you want to write it in C, have an array of char that is 67601 bits in length (67601/8 = 8451) and then turn on/off the appropriate bit for each value.

Others have given the right idea. Here's my own implementation of a bitsarr, or 'array' of bits. An unsigned char is one byte, so it's essentially an array of unsigned chars that stores information in individual bits. I added the option of storing TWO or FOUR bit values in addition to ONE bit values, because those both divide 8 (the size of a byte), and would be useful if you want to store a huge number of integers that will range from 0-3 or 0-15.
When setting and getting, the math is done in the functions, so you can just give it an index as if it were a normal array--it knows where to look.
Also, it's the user's responsibility to not pass a value to set that's too large, or it will screw up other values. It could be modified so that overflow loops back around to 0, but that would just make it more convoluted, so I decided to trust myself.
#include<stdio.h>
#include <stdlib.h>
#define BYTE 8
typedef enum {ONE=1, TWO=2, FOUR=4} numbits;
typedef struct bitsarr{
unsigned char* buckets;
numbits n;
} bitsarr;
bitsarr new_bitsarr(int size, numbits n)
{
int b = sizeof(unsigned char)*BYTE;
int numbuckets = (size*n + b - 1)/b;
bitsarr ret;
ret.buckets = malloc(sizeof(ret.buckets)*numbuckets);
ret.n = n;
return ret;
}
void bitsarr_delete(bitsarr xp)
{
free(xp.buckets);
}
void bitsarr_set(bitsarr *xp, int index, int value)
{
int buckdex, innerdex;
buckdex = index/(BYTE/xp->n);
innerdex = index%(BYTE/xp->n);
xp->buckets[buckdex] = (value << innerdex*xp->n) | ((~(((1 << xp->n) - 1) << innerdex*xp->n)) & xp->buckets[buckdex]);
//longer version
/*unsigned int width, width_in_place, zeros, old, newbits, new;
width = (1 << xp->n) - 1;
width_in_place = width << innerdex*xp->n;
zeros = ~width_in_place;
old = xp->buckets[buckdex];
old = old & zeros;
newbits = value << innerdex*xp->n;
new = newbits | old;
xp->buckets[buckdex] = new; */
}
int bitsarr_get(bitsarr *xp, int index)
{
int buckdex, innerdex;
buckdex = index/(BYTE/xp->n);
innerdex = index%(BYTE/xp->n);
return ((((1 << xp->n) - 1) << innerdex*xp->n) & (xp->buckets[buckdex])) >> innerdex*xp->n;
//longer version
/*unsigned int width = (1 << xp->n) - 1;
unsigned int width_in_place = width << innerdex*xp->n;
unsigned int val = xp->buckets[buckdex];
unsigned int retshifted = width_in_place & val;
unsigned int ret = retshifted >> innerdex*xp->n;
return ret; */
}
int main()
{
bitsarr x = new_bitsarr(100, FOUR);
for(int i = 0; i<16; i++)
bitsarr_set(&x, i, i);
for(int i = 0; i<16; i++)
printf("%d\n", bitsarr_get(&x, i));
for(int i = 0; i<16; i++)
bitsarr_set(&x, i, 15-i);
for(int i = 0; i<16; i++)
printf("%d\n", bitsarr_get(&x, i));
bitsarr_delete(x);
}

How to store double - endian independent

Despite the fact that big-endian computers are not very widely used, I want to store the double datatype in an independant format.
For int, this is really simple, since bit shifts make that very convenient.
int number;
int size=sizeof(number);
char bytes[size];
for (int i=0; i<size; ++i)
bytes[size-1-i] = (number >> 8*i) & 0xFF;
This code snipet stores the number in big endian format, despite the machine it is being run on. What is the most elegant way to do this for double?

The best way for portability and taking format into account, is serializing/deserializing the mantissa and the exponent separately. For that you can use the frexp()/ldexp() functions.
For example, to serialize:
int exp;
unsigned long long mant;
mant = (unsigned long long)(ULLONG_MAX * frexp(number, &exp));
// then serialize exp and mant.
And then to deserialize:
// deserialize to exp and mant.
double result = ldexp ((double)mant / ULLONG_MAX, exp);

The elegant thing to do is to limit the endianness problem to as small a scope as possible. That narrow scope is the I/O boundary between your program and the outside world. For example, the functions that send binary data to / receive binary data from some other application need to be aware of the endian problem, as do the functions that write binary data to / read binary data from some data file. Make those interfaces cognizant of the representation problem.
Make everything else blissfully ignorant of the problem. Use the local representation everywhere else. Represent a double precision floating point number as a double rather than an array of 8 bytes, represent a 32 bit integer as an int or int32_t rather than an array of 4 bytes, et cetera. Dealing with the endianness problem throughout your code is going to make your code bloated, error prone, and ugly.

The same. Any numeric object, including double, is eventually several bytes which are interpreted in a specific order according to endianness. So if you revert the order of the bytes you'll get exactly the same value in the reversed endianness.

char *src_data;
char *dst_data;
for (i=0;i<N*sizeof(double);i++) *dst_data++=src_data[i ^ mask];
// where mask = 7, if native == low endian
// mask = 0, if native = big_endian
The elegance lies in mask which handles also short and integer types: it's sizeof(elem)-1 if the target and source endianness differ.

Not very portable and standards violating, but something like this:
std::array<unsigned char, 8> serialize_double( double const* d )
{
std::array<unsigned char, 8> retval;
char const* begin = reinterpret_cast<char const*>(d);
char const* end = begin + sizeof(double);
union
{
uint8 i8s[8];
uint16 i16s[4];
uint32 i32s[2];
uint64 i64s;
} u;
u.i64s = 0x0001020304050607ull; // one byte order
// u.i64s = 0x0706050403020100ull; // the other byte order
for (size_t index = 0; index < 8; ++index)
{
retval[ u.i8s[index] ] = begin[index];
}
return retval;
}
might handle a platform with 8 bit chars, 8 byte doubles, and any crazy-ass byte ordering (ie, big endian in words but little endian between words for 64 bit values, for example).
Now, this doesn't cover the endianness of doubles being different than that of 64 bit ints.
An easier approach might be to cast your double into a 64 bit unsigned value, then output that as you would any other int.

void reverse_endian(double number, char (&bytes)[sizeof(double)])
{
const int size=sizeof(number);
memcpy(bytes, &number, size);
for (int i=0; i<size/2; ++i)
std::swap(bytes[i], bytes[size-i-1]);
}

Appropriate hashing function to hash random binary strings

i have an two arrays : char data1[length] where length is a multiple of 8 i.e length can be 8, 16,24 ... The array contains binary data read from a file that is open in binary mode. I will keep reading from the file and everytime i read i will store the read value in a hash table. The disterbution of this binary data has a random distribution. I would like to hash each array and store them in a hash table in order to be able to look for the char with the specific data again. What would be a good hashing function to achive this task. Thanks
Please note that i am writing this in c++ and c so any language you choose to provide a solution for would be great.

If the data that you read is 8 bytes long and really distributed randomly, and your hashcode needs to be 32 bits, what about this:
uint32_t hashcode(const unsigned char *data) {
uint32_t hash = 0;
hash ^= get_uint32_le(data + 0);
hash ^= get_uint32_le(data + 4);
return hash;
}
uint32_t get_uint32_le(const unsigned char *data) {
uint32_t value = 0;
value |= data[0] << 0;
value |= data[1] << 8;
value |= data[2] << 16;
value |= data[3] << 24;
return value;
}
If you need more speed, this code can probably made a lot faster if you can guarantee that data is always properly aligned to be interpreted as an const uint32_t *.

I have successfully used MurmurHash3 in one of my projects.
Pros:
It is fast. Very fast.
It supposedly has a low collision rate.
Cons:
It's not suitable for cryptography applications.
It's not standardized in any shape or form.
It's not portable to non-x86 platforms. However, it's small enough that you should be able to port it if you really need to - I was able to port it to Java, although that's not nearly the same thing.
It's a good possibility for use in e.g. a fast hash-table implementation...

Is there a way to improve the speed or efficiency of this lookup? (C/C++)

I have a function I've written to convert from a 64-bit integer to a base 62 string. Originally, I achieved this like so:
char* charset = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = strlen(charset);
std::string integerToKey(unsigned long long input)
{
unsigned long long num = input;
string key = "";
while(num)
{
key += charset[num % charsetLength];
num /= charsetLength;
}
return key;
}
However, this was too slow.
I improved the speed by providing an option to generate a lookup table. The table is about 624 strings in size, and is generated like so:
// Create the integer to key conversion lookup table
int lookupChars;
if(lookupDisabled)
lookupChars = 1;
else
largeLookup ? lookupChars = 4 : lookupChars = 2;
lookupSize = pow(charsetLength, lookupChars);
integerToKeyLookup = new char*[lookupSize];
for(unsigned long i = 0; i < lookupSize; i++)
{
unsigned long num = i;
int j = 0;
integerToKeyLookup[i] = new char[lookupChars];
while(num)
{
integerToKeyLookup[i][j] = charset[num % charsetLength];
num /= charsetLength;
j++;
}
// Null terminate the string
integerToKeyLookup[i][j] = '\0';
}
The actual conversion then looks like this:
std::string integerToKey(unsigned long long input)
{
unsigned long long num = input;
string key = "";
while(num)
{
key += integerToKeyLookup[num % lookupSize];
num /= lookupSize;
}
return key;
}
This improved speed by a large margin, but I still believe it can be improved. Memory usage on a 32-bit system is around 300 MB, and more than 400 MB on a 64-bit system. It seems like I should be able to reduce memory and/or improve speed, but I'm not sure how.
If anyone could help me figure out how this table could be further optimized, I'd greatly appreciate it.

Using some kind of string builder rather than repeated concatenation into 'key' would provide a significant speed boost.

You may want to reserve memory in advance for your string key. This may get you a decent performance gain, as well as a gain in memory utilization. Whenever you call the append operator on std::string, it may double the size of the internal buffer if it has to reallocate. This means each string may be taking up significantly more memory than is necessary to store the characters. You can avoid this by reserving memory for the string in advance.

I agree with Rob Walker - you're concentrating on improving performance in the wrong area. The string is the slowest part.
I timed the code (your original is broken, btw) and your original (when fixed) was 44982140 cycles for 100000 lookups and the following code is about 13113670.
const char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
#define CHARSET_LENGTH 62
// maximum size = 11 chars
void integerToKey(char result[13], unsigned long long input)
{
char* p = result;
while(input > 0)
{
*p++ = charset[input % CHARSET_LENGTH];
input /= CHARSET_LENGTH;
}
// null termination
*p = '\0';
// need to reverse the output
char* o = result;
while(o + 1 < p)
swap(*++o, *--p);
}

This is almost a textbook case of how not to do this. Concatenating strings in a loop is a bad idea, both because appending isn't particularly fast, and because you're constantly allocating memory.
Note: your question states that you're converting to base-62, but the code seems to have 63 symbols. Which are you trying to do?
Given a 64-bit integer, you can calculate that you won't need any more than 11 digits in the result, so using a static 12 character buffer will certainly help improve your speed. On the other hand, it's likely that your C++ library has a long-long equivalent to ultoa, which will be pretty optimal.
Edit: Here's something I whipped up. It allows you to specify any desired base as well:
std::string ullToString(unsigned long long v, int base = 64) {
assert(base < 65);
assert(base > 1);
static const char digits[]="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/";
const int max_length=65;
static char buffer[max_length];
buffer[max_length-1]=0;
char *d = buffer + max_length-1;
do {
d--;
int remainder = v % base;
v /= base;
*d = digits[remainder];
} while(v>0);
return d;
}
This only creates one std::string object, and doesn't move memory around unnecessarily. It currently doesn't zero-pad the output, but it's trivial to change it to do that to however many digits of output you want.

You don't need to copy input into num, because you pass it by value. You can also compute the length of charset in compiletime, there's no need to compute it in runtime every single time you call the function.
But these are very minor performance issues. I think the the most significant help you can gain is by avoiding the string concatenation in the loop. When you construct the key string pass the string constructor the length of your result string so that there is only one allocation for the string. Then in the loop when you concatenate into the string you will not re-allocate.
You can make things even slightly more efficient if you take the target string as a reference parameter or even as two iterators like the standard algorithms do. But that is arguably a step too far.
By the way, what if the value passed in for input is zero? You won't even enter the loop; shouldn't key then be "0"?
I see the value passed in for input can't be negative, but just so we note: the C remainder operator isn't a modulo operator.

Why not just use a base64 library? Is really important that 63 equals '11' and not a longer string?
size_t base64_encode(char* outbuffer, size_t maxoutbuflen, const char* inbuffer, size_t inbuflen);
std::string integerToKey(unsigned long long input) {
char buffer[14];
size_t len = base64_encode(buffer, sizeof buffer, (const char*)&input, sizeof input);
return std::string(buffer, len);
}
Yes, every string will end with an equal size. If you don't want it to, strip off the equal sign. (Just remember to add it back if you need to decode the number.)
Of course, my real question is why are you turning a fixed width 8byte value and not using it directly as your "key" instead of the variable length string value?
Footnote: I'm well aware of the endian issues with this. He didn't say what the key will be used for and so I assume it isn't being used in network communications between machines of disparate endian-ness.

If you could add two more symbols so that it is converting to base-64, your modulus and division operations would turn into a bit mask and shift. Much faster than a division.

If all you need is a short string key, converting to base-64 numbers would speed up things a lot, since div/mod 64 is very cheap (shift/mask).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js