How to hash a string to a number [duplicate]

How to hash a string to a number [duplicate] - c++

I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
Which is the best algorithm for hashing string?
And how to determinate the size of hash table?

I've had nice results with djb2 by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
Hash function | Collisions | Time (words) | Time (file)
=================================================================
CRC32 | 23 (0.005%) | 112 ms | 38 ms
MurmurOAAT | 26 (0.006%) | 86 ms | 10 ms
FNV hash | 32 (0.007%) | 87 ms | 7 ms
Jenkins OAAT | 36 (0.008%) | 90 ms | 8 ms
DJB2 hash | 344 (0.074%) | 87 ms | 5 ms
K&R V2 | 356 (0.076%) | 86 ms | 5 ms
Coffin | 763 (0.164%) | 86 ms | 4 ms
x17 hash | 2242 (0.481%) | 87 ms | 7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 | 19 (0.004%) | 90 ms | 3 ms
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.
Conclusion:
there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
Test code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 2048
#define SEED 0x12345678
uint32_t DJB2_hash(const uint8_t *str)
{
uint32_t hash = 5381;
uint8_t c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
uint32_t FNV(const void* key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
h ^= 2166136261UL;
const uint8_t* data = (const uint8_t*)key;
for(int i = 0; i < len; i++)
{
h ^= data[i];
h *= 16777619;
}
return h;
}
uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
// One-byte-at-a-time hash based on Murmur's mix
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
for (; *str; ++str) {
h ^= *str;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint32_t KR_v2_hash(const char *s)
{
// Source: https://stackoverflow.com/a/45641002/5407270
uint32_t hashval = 0;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval;
}
uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
uint32_t crc32b(const uint8_t *str) {
// Source: https://stackoverflow.com/a/21001712
unsigned int byte, crc, mask;
int i = 0, j;
crc = 0xFFFFFFFF;
while (str[i] != 0) {
byte = str[i];
crc = crc ^ byte;
for (j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single operation
}
uint32_t Coffin_hash(char const *input) {
// Source: https://stackoverflow.com/a/7666668/5407270
uint32_t result = 0x55555555;
while (*input) {
result ^= *input++;
result = _rotl32(result, 5);
}
return result;
}
uint32_t x17(const void * key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
const uint8_t * data = (const uint8_t*)key;
for (int i = 0; i < len; ++i)
{
h = 17 * h + (data[i] - ' ');
}
return h ^ (h >> 16);
}
uint32_t apply_hash(int hash, const char* line)
{
switch (hash) {
case 1: return crc32b((const uint8_t*)line);
case 2: return MurmurOAAT_32(line, SEED);
case 3: return FNV(line, strlen(line), SEED);
case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
case 5: return DJB2_hash((const uint8_t*)line);
case 6: return KR_v2_hash(line);
case 7: return Coffin_hash(line);
case 8: return x17(line, strlen(line), SEED);
default: break;
}
return 0;
}
int main(int argc, char* argv[])
{
// Read arguments
const int hash_choice = atoi(argv[1]);
char const* const fn = argv[2];
// Read file
FILE* f = fopen(fn, "r");
// Read file line by line, calculate hash
char line[MAXLINE];
while (fgets(line, sizeof(line), f)) {
line[strcspn(line, "\n")] = '\0'; // strip newline
uint32_t hash = apply_hash(hash_choice, line);
printf("%08x\n", hash);
}
fclose(f);
return 0;
}
P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

djb2 is good
Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:
One of the K&R hashes is terrible, one is probably pretty good:
Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
unsigned long hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
unsigned hash(char *s)
{
unsigned hashval;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval % HASHSIZE;
}
Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.
The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.
Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
djb2
sdbm
lose lose (K&R 1st edition)

First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
One strange things is that almost all the hash functions have 6% collision rate for my data.

One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

Related

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.

This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.

Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

Creating a hash / seed value from 2+ integers (fast)

I'm working on generating different types of Gradient Noise. One of the things that this noise requires is the generation of random vectors given a position vector.
This position vector could be anything from a single int, or a 2D position, 3D position, 4D position etc.
On top of this, an additional "seed" value is needed.
What's required is a hash of these n+1 integers into a unique integer with which I can seed a PRNG. It's important that it's these values as I need to be able to retrieve the original seed every time the same values are used.
So far I've tried an implementation of Fowler–Noll–Vo; but it was way too slow for my purposes.
I've also tried using successive calls to a pairing function:
int pairing_function(int x, int y)
{
return(0.5*(x+y)*(x+y+1) + x);
}
I.e.:
int hash = pairing_function(pairing_function(x,y),seed);
But what seems to happen is that with a large enough seed, the values overflow the size of an int (or even larger types).
What's a good method to achieve what I'm trying to do here? What's important is speed over any cryptographic concerns as well as not returning numbers larger than my original data types.
I'm using C++ but so long as any code is readable I can nut it out.

It is strange that FNV be way too slow because it is just 1 xor and 1 integer product per byte of data. From Wikipedia [it is ] designed to be fast to compute.
If you want something really quick, you can try these implementations, where the multiplication is coded as shifts and additions :
dan bernstein implementation :
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
sdbm implementation (hash(i) = hash(i - 1) * 65599 + str[i]) :
static unsigned long
sdbm(str)
unsigned char *str;
{
unsigned long hash = 0;
int c;
while (c = *str++)
hash = c + (hash << 6) + (hash << 16) - hash;
return hash;
}
References "Hash Functions" from cse.yorku.ca

It sounds like FNV you used might have been inefficient because of the way it was used. Here's (I think, I haven't tested it) the same thing in a way that could be trivially inlined.
inline uint32_t hash(uint32_t h, uint32_t x) {
for (int i = 0; i < 4; i++) {
h ^= x & 255;
x >>= 8;
h = (h << 24) + h * 0x193;
}
return h;
}
I think calling hash(hash(2166136261, seed), x) or hash(hash(hash(2166136261, seed), x), y) should give you the same result (assuming little-endian) as a library function.
However, to speed that up at the cost of hash quality, you can might try a change like this:
inline uint32_t hash(uint32_t h, uint32_t x) {
for (int i = 0; i < 2; i++) {
h ^= x & 65535;
x >>= 16;
h = (h << 24) + h * 0x193;
}
return h;
}
or even:
inline uint32_t hash(uint32_t h, uint32_t x) {
h ^= x;
h = (h << 24) + h * 0x193;
return h;
}
These changes weaken the low-order bits somewhat, so you'll want to follow standard practice in using the high-order bits preferentially. For example, if your need only 16 bits, then shift the final result right by 16 rather than masking it with 0xffff;
The h = ... line will regularly overflow an int, though, and it relies on the standard mod-2**32 behaviour. If that's a problem then you'll want to replace that line with something different and perhaps accept fewer useful bits in your hash. Maybe h = (h >> 4) + (h & 0x7fffff) * 0x193; but that's just a random tweak and I haven't checked it for hash quality.

I will challenge you on
So far I've tried an implementation of Fowler–Noll–Vo; but it was way too slow for my purposes.
as in some simple benchmarks I've done the FNV hash is the fastest. I assume you have benchmarks for all hashes you've tried?
For the benchmark I just simply measured the time taken for 1 billion hashes of various algorithms in MVSC++ 2013 using two 32-bit unsigned int for input:
FNV (32-bit) = 222M hashes/sec
Your pairing_function() = 175M hashes/sec
Simple Hash x + (y << 10) = 170M hashes/sec
Your hash() function using pairing_function() = 167M hashes/sec
Dan Bernstein = 101M hashes/sec
Obviously these are very basic benchmark results and I wouldn't necessarily trust them all that much. I wouldn't be surprised to see some algorithms run faster/slower on different platforms and compilers.
Overall though, while FNV is the fastest in this case there is only a factor of two difference between the fastest and slowest. If this really makes a difference in your case I would suggest taking another look at your problem to see if it can be redesigned to not need the hash or at least reduce the dependence on the hash speed.
Note: I changed your pairing function to:
int pairing_function(int x, int y)
{
return((x+y)*(x+y+1)/2 + x);
}
for the above benchmarks. Using your version results in a conversion to/from double which makes it x5 slower and your hash() function x8 slower.
Update
For the FNV hash I found a source online and modified it from there to work directly on 2 integers (assumes a 32-bit integer):
#define FNV_32_PRIME 16777619u
unsigned int FNVHash32(const int input1, const int input2)
{
unsigned int hash = 2166136261u;
const unsigned char* pBuf = (unsigned char *) &input1;
for (int i = 0; i < 4; ++i)
{
hash *= FNV_32_PRIME;
hash ^= *pBuf++;
}
pBuf = (unsigned char *) &input2;
for (int i = 0; i < 4; ++i)
{
hash *= FNV_32_PRIME;
hash ^= *pBuf++;
}
return hash;
}
Since FNV just works on bytes you can extend this to work with any number of integers or other data.

How to hash a 96-bit struct/number?

So I can't figure out how to do this in C++. I need to do a modulus operation and integer conversion on data that is 96 bits in length.
Example:
struct Hash96bit
{
char x[12];
};
int main()
{
Hash96bit n;
// set n to something
int size = 23;
int result = n % size
}
Edit: I'm trying to have a 96 bit hash because i have 3 floats which when combined create a unique combination. Thought that would be best to use as the hash because you don't really have to process it at all.
Edit: Okay... so at this point I might as well explain the bigger issue. I have a 3D world that I want to subdivide into sectors, that way groups of objects can be placed in sectors that would make frustum culling and physics iterations take less time. So at the begging lets say you are at sector 0,0,0. Sure we store them all in array, cool, but what happens when we get far away from 0,0,0? We don't care about those sectors there anymore. So we use a hashmap since memory isn't an issue and because we will be accessing data with sector values rather than handles. Now a sector is 3 floats, hashing that could easily be done with any number of algorithms. I thought it might be better if I could just say the 3 floats together is the key and go from there, I just needed a way to mod a 96 bit number to fit it in the data segment. Anyway I think i'm just gonna take the bottom bits of each of these floats and use a 64 bit hash unless anyone comes up with something brilliant. Thank you for the advice so far.

UPDATE: Having just read your second edit to the question, I'd recommend you use David's jenkin's approach (which I upvoted a while back)... just point it at the lowest byte in your struct of three floats.
Regarding "Anyway I think i'm just gonna take the bottom bits of each of these floats" - again, the idea with a hash function used by a hash table is not just to map each bit in the input (less till some subset of them) to a bit in the hash output. You could easily end up with a lot of collisions that way, especially if the number of buckets is not a prime number. For example, if you take 21 bits from each float, and the number of buckets happens to be 1024 currently, then after % 1024 only 10 bits from one of the floats will be used with no regard to the values of the other floats... hash(a,b,c) == hash(d,e,c) for all c (it's actually a little worse than that - values like 5.5, 2.75 etc. will only use a couple bits of the mantissa....).
Since you're insisting on this (though it's very likely not what you need, and a misnomer to boot):
struct Hash96bit
{
union {
float f[3];
char x[12];
uint32_t u[3];
};
Hash96bit(float a, float b, float c)
{
f[0] = a;
f[1] = b;
f[2] = c;
}
// the operator will support your "int result = n % size;" usage...
operator uint128_t() const
{
return u[0] * ((uint128_t)1 << 64) + // arbitrary ordering
u[1] + ((uint128_t)1 << 32) +
u[2];
}
};

You can use jenkins hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

How to manipulate and represent binary numbers in C++

I'm currently trying to build a lookup table for a huffman tree using a pretty simple preorder traversal algorithm, but I'm getting stuck carrying out very basic bit wise operations. The psuedo code follows:
void preOrder(huffNode *node, int bit) //not sure how to represent bit
{
if (node == NULL)
return;
(1) bit = bit + 0; //I basically want to add a 0 onto this number (01 would go to 010)
preOrder(node->getLeft(), bit);
(2) bit = bit - 0 + 1; //This should subtract the last 0 and add a 1 (010 would go to 011)
preOrder(node->getRight());
}
I'm getting quite confused about how to carry out the operations defined on lines (1) and (2)
What data type type does one use to represent and print binary numbers? In the above example I have the number represented as an int, but i'm pretty sure that that is incorrect. Also how do you add or subtract values? I understand how & and | types logic works, but I'm getting confused as to how one carries out these sorts of operations in code.
Could anyone post some very simple examples?

Here's some basic examples of binary operations. I've used mostly in-place operations here.
int bit = 0x02; // 0010
bit |= 1; // OR 0001 -> 0011
bit ^= 1; // XOR 0001 -> 0010
bit ^= 7; // XOR 0111 -> 0101
bit &= 14; // AND 1110 -> 0100
bit <<= 1; // LSHIFT 1 -> 1000
bit >>= 2; // RSHIFT 2 -> 0010
bit = ~bit; // COMPLEMENT -> 1101
If you want to print a binary number you need to do it yourself... Here's one slightly inefficient, but moderately readable, way to do it:
char bitstr[33] = {0};
for( int b = 0; b < 32; b++ ) {
if( bit & (1 << (31-b)) )
bitstr[b] = '1';
else
bitstr[b] = '0';
}
printf( "%s\n", bitstr );
[edit] If I wanted faster code, I might pre-generate (or hardcode) a lookup table with the 8-bit sequences for all numbers from 0-255.
// This turns a 32-bit integer into a binary string.
char lookup[256][9] = {
"00000000",
"00000001",
"00000010",
"00000011",
// ... etc (you don't want to do this by hand)
"11111111"
};
char * lolo = lookup[val & 0xff];
char * lohi = lookup[(val>>8) & 0xff];
char * hilo = lookup[(val>>16) & 0xff];
char * hihi = lookup[(val>>24) & 0xff];
// This part is maybe a bit lazy =)
char bitstr[33];
sprintf( "%s%s%s%s", hihi, hilo, lohi, lolo );
Instead, you could do this:
char *bits = bitstr;
while( *hihi ) *bits++ = *hihi++;
while( *hilo ) *bits++ = *hilo++;
while( *lohi ) *bits++ = *lohi++;
while( *lolo ) *bits++ = *lolo++;
*bits = 0;
Or just unroll the whole thing. ;-)
char bitstr[33] = {
hihi[0], hihi[1], hihi[2], hihi[3], hihi[4], hihi[5], hihi[6], hihi[7],
hilo[0], hilo[1], hilo[2], hilo[3], hilo[4], hilo[5], hilo[6], hilo[7],
lohi[0], lohi[1], lohi[2], lohi[3], lohi[4], lohi[5], lohi[6], lohi[7],
lolo[0], lolo[1], lolo[2], lolo[3], lolo[4], lolo[5], lolo[6], lolo[7],
0 };
Of course, those 8 bytes in the lookup are the same length as a 64-bit integer... So what about this? Much faster than all that pointless meandering through character arrays.
char bitstr[33];
__int64 * intbits = (__int64*)bitstr;
intbits[0] = *(__int64*)lookup[(val >> 24) & 0xff];
intbits[1] = *(__int64*)lookup[(val >> 16) & 0xff];
intbits[2] = *(__int64*)lookup[(val >> 8) & 0xff];
intbits[3] = *(__int64*)lookup[val & 0xff];
bitstr[32] = 0;
Naturally, in the above code you would represent your lookup values as int64 instead of strings.
Anyway, just pointing out that you can write it however is appropriate for your purposes. If you need to optimize, things get fun, but for most practical applications such optimizations are negligible or pointless.

Unless your binary sequences will get longer than the number of bits in an int, you can just use an int.
To add a 0 to the end of the current representation of a, you can use
a << 1
To replace a 0 at the end of the current representation of a with a 1, you can use
a ^= 1
Note that to use an int in this way, you will also need to keep track of where in the int your bits start, so that if you have e.g., the value 0x0, you can know which of 0, 00, 000, ... it is.

Operations in your code:
(1) bit = bit << 1;
(2) bit = bit|1;
However, you must also keep the length of the sequence.
If length of an int is good enough for you, there's no reason not to use it. However, in huffman algorithm it would really depend on the data. C++ programmers should use boost::dynamic_bitset for bit sequences of an arbitrary length. It also supports the bit operations above. http://www.boost.org/doc/libs/1_42_0/libs/dynamic_bitset/dynamic_bitset.html

How to convert 8 17-bit integers into 17 8-bit integers efficiently

Okay, I have the following problem: I have a set of 8 (unsigned) numbers that are all 17bit (a.k.a. none of them are any bigger than 131071). Since 17bit numbers are annoying work work with (keeping them in a 32-bit int is a waste of space), I would like to turn these into 17 8-bit numbers, like so:
If I have these 8 17-bit integers:
[25409, 23885, 24721, 23159, 25409, 23885, 24721, 23159]
I would turn them into a base 2 representationL
["00110001101000001", "00101110101001101", "00110000010010001", "00101101001110111", "00110001101000001", "00101110101001101", "00110000010010001", "00101101001110111"]
Then join that into one big string:
"0011000110100000100101110101001101001100000100100010010110100111011100110001101000001001011101010011010011000001001000100101101001110111"
Then split that into 17 strings, each with 8 chars:
["00110001", "10100000", "10010111", "01010011", "01001100", "00010010", "00100101", "10100111", "01110011", "00011010", "00001001", "01110101", "00110100", "11000001", "00100010", "01011010", "01110111"]
And, finally, convert the binary representations back into integers
[49, 160, 151, 83, 76, 18, 37, 167, 115, 26, 9, 117, 52, 193, 34, 90, 119]
This method works, but it's not very efficient, I am looking for something more efficient than this, preferrably coded in C++, since that's the language I am working with. I just can't think of any way to do this more efficient, and 17-bit numbers aren't exactly easy to work with (16-bit numbers would be much nicer to work with).
Thanks in advance, xfbs

Store the lowest 16 bits of each number as-is (i.e. in two bytes). This leaves the most significant bit of each number. Since there are eight such numbers, simply combine the eight bits into one extra byte.
This will require exactly the same amount of memory as your method, but will involve a lot less bit twiddling.
P.S. Regardless of the storage method, you should be using bit-manipulation operators (<<, >>, &, | and so on) to do the job; there should not be any intermediate string-based representations involved.

Have a look at std::bitset<N>. May be you can stuff them into that?

Efficiently? Then don't use string conversions, bitfields, etc. Manage to do shifts yourself to achieve that. (Note that the arrays must be unsigned so that we don't encounter problems when shifting).
uint32 A[8]; //Your input, unsigned int
ubyte B[17]; //Output, unsigned byte
B[0] = (ubyte)A[0];
B[1] = (ubyte)(A[0] >> 8);
B[2] = (ubyte)A[1];
B[3] = (ubyte)(A[1] >> 8);
.
:
And for the last one, we do what ajx said. We take the most significant digit of each number (shifting them 16 bits to the right leaves the 17th digit) and fill the bits of our output by shifting each of the most significant digits from 0 to 7 to the left:
B[16] = (A[0] >> 16) | ((A[1] >> 16) << 1) | ((A[2] >> 16) << 2) | ((A[3] >> 16) << 3) | ... | ((A[7] >> 16) << 7);
Well, "efficient" was this. Other easier methods exist, too.

Though you say they are 17-bit numbers, they must be stored into an array of 32bit integers, where only the less significant 17 bits are used. You can extract from the first directly two bytes (dst[0] = src[0] >> 9 is the first, dst[1] = (src[0] >> 1) & 0xff the second); then you "push" the first bit as the 18th bit of the second, so that
dst[2] = (src[0] & 1) << 7 | src[1] >> 10;
dst[3] = (src[1] >> 2) & 0xff;
if you generalize it, you will see that this "formula" may be applied
dst[2*i] = src[i] >> (9+i) | (src[i-1] & BITS(i)) << (8-i);
dst[2*i + 1] = (src[i] >> (i+1)) & 0xff;
and for the last one: dst[16] = src[7] & 0xff;.
The whole code could look like
dst[0] = src[0] >> 9;
dst[1] = (src[0] >> 1) & 0xff;
for(i = 1; i < 8; i++)
{
dst[2*i] = src[i] >> (9+i) | (src[i-1] & BITS(i)) << (8-i);
dst[2*i + 1] = (src[i] >> (i+1)) & 0xff;
}
dst[16] = src[7] & 0xff;
Likely analysing better the loops, optimizations can be done so that we don't need to treat in a special manner the cases on the boundaries. The BITS macro create a mask of N bits set to 1 (least significant bits). Something like (to be checked for a better way, if any)
#define BITS(I) (~((~0)<<(I)))
ADD
Here I supposed src is e.g. int32_t and dst int8_t or alike.

This is in C, so you can use vector instead.
#define srcLength 8
#define destLength 17
int src[srcLength] = { 25409, 23885, 24721, 23159, 25409, 23885, 24721, 23159 };
unsigned char dest[destLength] = { 0 };
int srcElement = 0;
int bits = 0;
int i = 0;
int j = 0;
do {
while( bits >= srcLength ) {
dest[i++] = srcElement >> (bits - srcLength);
srcElement = srcElement & ((1 << bits) - 1);
bits -= srcLength;
}
if( j < srcLength ) {
srcElement <<= destLength;
bits += destLength;
srcElement |= src[j++];
}
} while (bits > 0);
Disclaimer: if you literally have seventeen integers (and not 100000 groups by 17), you should forget these optimizations as long as your program doesn't run veeery slowly.

I'd probably go about it this way. I don't want to deal with weird types when I'm doing my processing. Maybe I need to store them in some funky formatting due to legacy problems though. The values that are hard-coded should probably be based off of the 17 value, just didn't bother.
struct int_block {
static const uint32 w = 17;
static const uint32 m = 131071;
int_block() : data(151, 0) {} // w * 8 + (sizeof(uint32) - w)
uint32 get(size_t i) const {
uint32 retval = *reinterpret_cast<const uint32 *>( &data[i*w] );
retval &= m;
return retval;
}
void set(size_t i, uint32 val) {
uint32 prev = *reinterpret_cast<const uint32 *>( &data[i*w] );
prev &= ~m;
val |= prev;
*reinterpret_cast<uint32 *>( &data[i*w] ) = val;
}
std::vector<char> data;
};
TEST(int_block_test) {
int_block ib;
for (uint32 i = 0; i < 8; i++)
ib.set(i, i+25);
for (uint32 i = 0; i < 8; i++)
CHECK_EQUAL(i+25, ib.get(i));
}
You'd be able to break this by giving it bad values, but I'll leave that as an exercise for the reader. :))
Quite honestly, I think you'd be happier off representing them as 32-bit integers and just writing conversion functions. But I suspect you don't have control over that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js