int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<char> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
const int x = corr_indexes[i] / n_attrs;
const int y = corr_indexes[i] % n_attrs;
if (selected[x] || selected[y]) continue; // fit inside L1 cache
// below lines are called max 2500 times, so they're insignificant
selected[x] = true;
selected[y] = true;
selectedPairs.emplace_back(x, y);
if (selectedPairs.size() == n_attrs / 2) break;
}
I have a function that looks like this. The bottleneck is in
const int x = corr_indexes[i] / n_attrs;
const int y = corr_indexes[i] % n_attrs;
n_attrs is const during the loop, so I wish to find a way to speed up this loop. corr_indexes[i], n_attrs > 0, < max_int32. Edit: please note that n_attrs isn't compile-time const.
How can I optimize this loop? No extra library is allowed.
Also, is their any way to parallelize this loop (either CPU or GPU are okay, everything is already on GPU memory before this loop).
I am restricting my comments to integer division, because to first order the modulo operation in C++ can be viewed and implemented as an integer division plus back-multiply and subtraction, although in some cases, there are cheaper ways of computing the modulo directly, e.g. when computing modulo 2n.
Integer division is pretty slow on most platforms, based on either software emulation or an iterative hardware implementation. But it was widely reported last year that based on microbenchmarking on Apple's M1, it has a blazingly fast integer division, presumably by using dedicated circuitry.
Ever since a seminal paper by Torbjörn Granlund and Peter Montgomery almost thirty years ago it has been widely known how to replace integer divisions with constant divisors by using an integer multiply plus possibly a shift and / or other correction steps. This algorithm is often referred to as the magic-multiplier technique. It requires precomputation of some relevant parameters from the integer divisor for use in the multiply-based emulation sequence.
Torbjörn Granlund and Peter L. Montgomery, "Division by invariant integers using multiplication," ACM SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72 (online).
At current, all major toolchains incorporate variants of the Granlund-Montgomery algorithm when dealing with integer divisors that are compile-time constant. The pre-computation occurs at compilation time inside the compiler, which then emits code using the computed parameters. Some toolchains may also use this algorithm for divisions by run-time constant divisors that are used repeatedly. For run-time constant divisors in loops, this could involve emitting a pre-computation block prior to a loop to compute the necessary parameters, and then using those for the division emulation code inside the loop.
If one's toolchain does not optimize divisions with run-time constant divisor one can use the same approach manually as demonstrated by the code below. However, this is unlikely to achieve the same efficiency as a compiler-based solution, because not all machine operations used in the desired emulation sequence can be expressed efficiently at C++ level in a portable manner. This applies in particular to arithmetic right shifts and add-with-carry.
The code below demonstrates the principle of parameter precomputation and integer division emulation via multiplication. It is quite likely that by investing more time into the design than I was willing to expend for this answer more efficient implementations of both parameter precomputation and emulation can be identified.
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#define PORTABLE (1)
uint32_t ilog2 (uint32_t i)
{
uint32_t t = 0;
i = i >> 1;
while (i) {
i = i >> 1;
t++;
}
return (t);
}
/* Based on: Granlund, T.; Montgomery, P.L.: "Division by Invariant Integers
using Multiplication". SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72
*/
void prepare_magic (int32_t divisor, int32_t &multiplier, int32_t &add_mask, int32_t &sign_shift)
{
uint32_t divisoru, d, n, i, j, two_to_31 = uint32_t (1) << 31;
uint64_t m_lower, m_upper, k, msb, two_to_32 = uint64_t (1) << 32;
divisoru = uint32_t (divisor);
d = (divisor < 0) ? (0 - divisoru) : divisoru;
i = ilog2 (d);
j = two_to_31 % d;
msb = two_to_32 << i;
k = msb / (two_to_31 - j);
m_lower = msb / d;
m_upper = (msb + k) / d;
n = ilog2 (uint32_t (m_lower ^ m_upper));
n = (n > i) ? i : n;
m_upper = m_upper >> n;
i = i - n;
multiplier = int32_t (uint32_t (m_upper));
add_mask = (m_upper >> 31) ? (-1) : 0;
sign_shift = int32_t ((divisoru & two_to_31) | i);
}
int32_t arithmetic_right_shift (int32_t a, int32_t s)
{
uint32_t msb = uint32_t (1) << 31;
uint32_t ua = uint32_t (a);
ua = ua >> s;
msb = msb >> s;
return int32_t ((ua ^ msb) - msb);
}
int32_t magic_division (int32_t dividend, int32_t multiplier, int32_t add_mask, int32_t sign_shift)
{
int64_t prod = int64_t (dividend) * multiplier;
int32_t quot = (int32_t)(uint64_t (prod) >> 32);
quot = int32_t (uint32_t (quot) + (uint32_t (dividend) & uint32_t (add_mask)));
#if PORTABLE
const int32_t byte_mask = 0xff;
quot = arithmetic_right_shift (quot, sign_shift & byte_mask);
#else // PORTABLE
quot = quot >> sign_shift; // must mask shift count & use arithmetic right shift
#endif // PORTABLE
quot = int32_t (uint32_t (quot) + (uint32_t (dividend) >> 31));
if (sign_shift < 0) quot = -quot;
return quot;
}
int main (void)
{
int32_t multiplier;
int32_t add_mask;
int32_t sign_shift;
int32_t divisor;
for (divisor = -20; divisor <= 20; divisor++) {
/* avoid division by zero */
if (divisor == 0) {
divisor++;
continue;
}
printf ("divisor=%d\n", divisor);
prepare_magic (divisor, multiplier, add_mask, sign_shift);
printf ("multiplier=%d add_mask=%d sign_shift=%d\n",
multiplier, add_mask, sign_shift);
printf ("exhaustive test of dividends ... ");
uint32_t dividendu = 0;
do {
int32_t dividend = (int32_t)dividendu;
/* avoid overflow in signed integer division */
if ((divisor == (-1)) && (dividend == ((-2147483647)-1))) {
dividendu++;
continue;
}
int32_t res = magic_division (dividend, multiplier, add_mask, sign_shift);
int32_t ref = dividend / divisor;
if (res != ref) {
printf ("\nERR dividend=%d (%08x) divisor=%d res=%d ref=%d\n",
dividend, (uint32_t)dividend, divisor, res, ref);
return EXIT_FAILURE;
}
dividendu++;
} while (dividendu);
printf ("PASSED\n");
}
return EXIT_SUCCESS;
}
How can I optimize this loop?
This is a perfect use-case for libdivide. This library has been designed to speed up division by constant at run-time by using the strategy compilers use at compile-time. The library is header-only so it does not create any run-time dependency. It also support the vectorization of divisions (ie. using SIMD instructions) which is definitively something to use in this case to drastically speed up the computation which compilers cannot do without changing significantly the loop (and in the end it will be not as efficient because of the run-time-defined divisor). Note that the licence of libdivide is very permissive (zlib) so you can easily include it in your project without strong constraints (you basically just need to mark it as modified if you change it).
If header only-libraries are not OK, then you need to reimplement the wheel. The idea is to transform a division by a constant to a sequence of shift and multiplications. The very good answer of #njuffa specify how to do that. You can also read the code of libdivide which is highly optimized.
For small positive divisors and small positive dividends, there is no need for a long sequence of operation. You can cheat with a basic sequence:
uint64_t dividend = corr_indexes[i]; // Must not be too big
uint64_t divider = n_attrs;
uint64_t magic_factor = 4294967296 / n_attrs + 1; // Must be precomputed once
uint32_t result = (dividend * magic_factor) >> 32;
This method should be safe for uint16_t dividends/divisors, but it is not for much bigger values. In practice if fail for dividend values above ~800_000. Bigger dividends require a more complex sequence which is also generally slower.
is their any way to parallelize this loop
Only the division/modulus can be safely parallelized. There is a loop carried dependency in the rest of the loop that prevent any parallelization (unless additional assumptions are made). Thus, the loop can be split in two parts: one that compute the division and put the uint16_t results in a temporary array computed later serially. The array needs not to be too big, since the computation would be memory bound otherwise and the resulting parallel code can be slower than the current one. Thus, you need to operate on small chunks that fit in at least the L3 cache. If chunks are too small, then thread synchronizations can also be an issue. The best solution is certainly to use a rolling window of chunks. All of this is certainly a bit tedious/tricky to implement.
Note that SIMD instructions can be used for the division part (easy with libdivide). You also need to split the loop and use chunks but chunks do not need to be big since there is no synchronization overhead. Something like 64 integers should be enough.
Note that recent processor can compute divisions like this efficiently, especially for 32-bit integers (64-bit ones tends to be significantly more expensive). This is especially the case of the Alder lake, Zen3 and M1 processors (P-cores). Note that both the modulus and the division are computed in one instruction on x86/x86-64 processors. Also note that while the division has a pretty big latency, many processors can pipeline multiple divisions so to get a reasonable throughput. For example, a 32-bit div instruction has a latency of 23~28 cycles on Skylake but a reciprocal throughput of 4~6. This is apparently not the case on Zen1/Zen2.
I would optimize the part after // optimize the code below by:
taking n_attrs
generating a function string like this:
void dynamicFunction(MyType & selectedPairs, Foo & selected)
{
const int npairs = ## * ##;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
const int x = corr_indexes[i] / ##;
const int y = corr_indexes[i] % ##;
if (selected[x] || selected[y]) continue; // fit inside L1 cache
// below lines are called max 2500 times, so they're insignificant
selected[x] = true;
selected[y] = true;
selectedPairs.emplace_back(x, y);
if (selectedPairs.size() == ## / 2)
break;
}
}
replacing all ## with value of n_attrs
compiling it, generating a DLL
linking and calling the function
So that the n_attrs is a compile-time constant value for the DLL and the compiler can automatically do most of its optimization on the value like:
doing n&(x-1) instead of n%x when x is power-of-2 value
shifting and multiplying instead of dividing
maybe other optimizations too, like unrolling the loop with precalculated indices for x and y (since x is known)
Some integer math operations in tight-loops are easier to SIMDify/vectorize by compiler when more of the parts are known in compile-time.
If your CPU is AMD, you can even try magic floating-point operations in place of unknown/unknown division to get vectorization.
By caching all (or big percentage of) values of n_attrs, you can get rid of latencies of:
string generation
compiling
file(DLL) reading (assuming some object-oriented wrapping of DLLs)
If the part to be optimized will be run in GPU, there is high possibility of CUDA/OpenCL implementation already doing the integer division in means of floating-point (to keep SIMD path occupied instead of being serialized on integer division) or just being capable directly as SIMD integer operations so you may just use it as it is in the GPU, except the std::vector which is not supported by all C++ CUDA compilers (and not in OpenCL kernel). These host-environment-related parts could be computed after the kernel (with the parts excluding emplace_back or exchanged with a struct that works in GPU) is executed.
So the actual best solution in my case.
Instead of representing index = row * n_cols + col, do index = (row << 16) | col for 32 bit, or index = (row << 32) | col for 64 bits. Then row = index >> 32, col = index & (32 - 1). Or even better, just uint16_t* pairs = reinterpret_cast<uint16_t*>(index_array);, then pair[i], pair[i+1] for each i % 2 == 0 is a pair.
This is assuming the number of rows/columns is less than 2^16 (or 2^32).
I'm still keeping the top answer because it still answers the case where division has to be used.
I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
Which is the best algorithm for hashing string?
And how to determinate the size of hash table?
I've had nice results with djb2 by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).
I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
Hash function | Collisions | Time (words) | Time (file)
=================================================================
CRC32 | 23 (0.005%) | 112 ms | 38 ms
MurmurOAAT | 26 (0.006%) | 86 ms | 10 ms
FNV hash | 32 (0.007%) | 87 ms | 7 ms
Jenkins OAAT | 36 (0.008%) | 90 ms | 8 ms
DJB2 hash | 344 (0.074%) | 87 ms | 5 ms
K&R V2 | 356 (0.076%) | 86 ms | 5 ms
Coffin | 763 (0.164%) | 86 ms | 4 ms
x17 hash | 2242 (0.481%) | 87 ms | 7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 | 19 (0.004%) | 90 ms | 3 ms
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.
Conclusion:
there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
Test code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 2048
#define SEED 0x12345678
uint32_t DJB2_hash(const uint8_t *str)
{
uint32_t hash = 5381;
uint8_t c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
uint32_t FNV(const void* key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
h ^= 2166136261UL;
const uint8_t* data = (const uint8_t*)key;
for(int i = 0; i < len; i++)
{
h ^= data[i];
h *= 16777619;
}
return h;
}
uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
// One-byte-at-a-time hash based on Murmur's mix
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
for (; *str; ++str) {
h ^= *str;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint32_t KR_v2_hash(const char *s)
{
// Source: https://stackoverflow.com/a/45641002/5407270
uint32_t hashval = 0;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval;
}
uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
uint32_t crc32b(const uint8_t *str) {
// Source: https://stackoverflow.com/a/21001712
unsigned int byte, crc, mask;
int i = 0, j;
crc = 0xFFFFFFFF;
while (str[i] != 0) {
byte = str[i];
crc = crc ^ byte;
for (j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single operation
}
uint32_t Coffin_hash(char const *input) {
// Source: https://stackoverflow.com/a/7666668/5407270
uint32_t result = 0x55555555;
while (*input) {
result ^= *input++;
result = _rotl32(result, 5);
}
return result;
}
uint32_t x17(const void * key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
const uint8_t * data = (const uint8_t*)key;
for (int i = 0; i < len; ++i)
{
h = 17 * h + (data[i] - ' ');
}
return h ^ (h >> 16);
}
uint32_t apply_hash(int hash, const char* line)
{
switch (hash) {
case 1: return crc32b((const uint8_t*)line);
case 2: return MurmurOAAT_32(line, SEED);
case 3: return FNV(line, strlen(line), SEED);
case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
case 5: return DJB2_hash((const uint8_t*)line);
case 6: return KR_v2_hash(line);
case 7: return Coffin_hash(line);
case 8: return x17(line, strlen(line), SEED);
default: break;
}
return 0;
}
int main(int argc, char* argv[])
{
// Read arguments
const int hash_choice = atoi(argv[1]);
char const* const fn = argv[2];
// Read file
FILE* f = fopen(fn, "r");
// Read file line by line, calculate hash
char line[MAXLINE];
while (fgets(line, sizeof(line), f)) {
line[strcspn(line, "\n")] = '\0'; // strip newline
uint32_t hash = apply_hash(hash_choice, line);
printf("%08x\n", hash);
}
fclose(f);
return 0;
}
P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.
Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.
djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.
djb2 is good
Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:
One of the K&R hashes is terrible, one is probably pretty good:
Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
unsigned long hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
unsigned hash(char *s)
{
unsigned hashval;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval % HASHSIZE;
}
Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.
The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.
Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
djb2
sdbm
lose lose (K&R 1st edition)
First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.
I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
One strange things is that almost all the hash functions have 6% collision rate for my data.
One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.
As we know if n is not a perfect square, then sqrt(n) would not be an integer. Since I need only the integer part, I feel that calling sqrt(n) wouldn't be that fast, as it takes time to calculate the fractional part also.
So my question is,
Can we get only the integer part of sqrt(n) without calculating the actual value of sqrt(n)? The algorithm should be faster than sqrt(n) (defined in <math.h> or <cmath>)?
If possible, you can write the code in asm block also.
I would try the Fast Inverse Square Root trick.
It's a way to get a very good approximation of 1/sqrt(n) without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).
Once you get it, you just need to inverse the result, and takes the integer part.
There might be faster tricks, of course, since this one is a bit of a round about.
EDIT: let's do it!
First a little helper:
// benchmark.h
#include <sys/time.h>
template <typename Func>
double benchmark(Func f, size_t iterations)
{
f();
timeval a, b;
gettimeofday(&a, 0);
for (; iterations --> 0;)
{
f();
}
gettimeofday(&b, 0);
return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
(a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}
Then the main body:
#include <iostream>
#include <cmath>
#include "benchmark.h"
class Sqrt
{
public:
Sqrt(int n): _number(n) {}
int operator()() const
{
double d = _number;
return static_cast<int>(std::sqrt(d) + 0.5);
}
private:
int _number;
};
// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
IntSqrt(int n): _number(n) {}
int operator()() const
{
int remainder = _number;
if (remainder < 0) { return 0; }
int place = 1 <<(sizeof(int)*8 -2);
while (place > remainder) { place /= 4; }
int root = 0;
while (place)
{
if (remainder >= root + place)
{
remainder -= root + place;
root += place*2;
}
root /= 2;
place /= 4;
}
return root;
}
private:
int _number;
};
// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
FastSqrt(int n): _number(n) {}
int operator()() const
{
float number = _number;
float x2 = number * 0.5F;
float y = number;
long i = *(long*)&y;
//i = (long)0x5fe6ec85e7de30da - (i >> 1);
i = 0x5f3759df - (i >> 1);
y = *(float*)&i;
y = y * (1.5F - (x2*y*y));
y = y * (1.5F - (x2*y*y)); // let's be precise
return static_cast<int>(1/y + 0.5f);
}
private:
int _number;
};
int main(int argc, char* argv[])
{
if (argc != 3) {
std::cerr << "Usage: %prog integer iterations\n";
return 1;
}
int n = atoi(argv[1]);
int it = atoi(argv[2]);
assert(Sqrt(n)() == IntSqrt(n)() &&
Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";
double time = benchmark(Sqrt(n), it);
double intTime = benchmark(IntSqrt(n), it);
double fastTime = benchmark(FastSqrt(n), it);
std::cout << "Number iterations: " << it << "\n"
"Sqrt computation : " << time << "\n"
"Int computation : " << intTime << "\n"
"Fast computation : " << fastTime << "\n";
return 0;
}
And the results:
sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation : 217
Fast computation : 119
// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation : 313
Fast computation : 119
Where as expected the Fast computation performs much better than the Int computation.
Oh, and by the way, sqrt is faster :)
Edit: this answer is foolish - use (int) sqrt(i)
After profiling with proper settings (-march=native -m64 -O3) the above was a lot faster.
Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.
It basicly comes down to this:
unsigned short isqrt(unsigned long a) {
unsigned long rem = 0;
int root = 0;
int i;
for (i = 0; i < 16; i++) {
root <<= 1;
rem <<= 2;
rem += a >> 30;
a <<= 2;
if (root < rem) {
root++;
rem -= root;
root++;
}
}
return (unsigned short) (root >> 1);
}
On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i) took 2750 ms. Using (unsigned short) sqrt((float) i) took 3600ms. This was done using g++ -O3. Using the -ffast-math compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.
The above code works for both C and C++ and with minor syntax changes also for Java.
What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:
#include <stdint.h>
const uint16_t squares[] = {
0, 1, 4, 9,
16, 25, 36, 49,
64, 81, 100, 121,
144, 169, 196, 225,
256, 289, 324, 361,
400, 441, 484, 529,
576, 625, 676, 729,
784, 841, 900, 961,
1024, 1089, 1156, 1225,
1296, 1369, 1444, 1521,
1600, 1681, 1764, 1849,
1936, 2025, 2116, 2209,
2304, 2401, 2500, 2601,
2704, 2809, 2916, 3025,
3136, 3249, 3364, 3481,
3600, 3721, 3844, 3969,
4096, 4225, 4356, 4489,
4624, 4761, 4900, 5041,
5184, 5329, 5476, 5625,
5776, 5929, 6084, 6241,
6400, 6561, 6724, 6889,
7056, 7225, 7396, 7569,
7744, 7921, 8100, 8281,
8464, 8649, 8836, 9025,
9216, 9409, 9604, 9801,
10000, 10201, 10404, 10609,
10816, 11025, 11236, 11449,
11664, 11881, 12100, 12321,
12544, 12769, 12996, 13225,
13456, 13689, 13924, 14161,
14400, 14641, 14884, 15129,
15376, 15625, 15876, 16129,
16384, 16641, 16900, 17161,
17424, 17689, 17956, 18225,
18496, 18769, 19044, 19321,
19600, 19881, 20164, 20449,
20736, 21025, 21316, 21609,
21904, 22201, 22500, 22801,
23104, 23409, 23716, 24025,
24336, 24649, 24964, 25281,
25600, 25921, 26244, 26569,
26896, 27225, 27556, 27889,
28224, 28561, 28900, 29241,
29584, 29929, 30276, 30625,
30976, 31329, 31684, 32041,
32400, 32761, 33124, 33489,
33856, 34225, 34596, 34969,
35344, 35721, 36100, 36481,
36864, 37249, 37636, 38025,
38416, 38809, 39204, 39601,
40000, 40401, 40804, 41209,
41616, 42025, 42436, 42849,
43264, 43681, 44100, 44521,
44944, 45369, 45796, 46225,
46656, 47089, 47524, 47961,
48400, 48841, 49284, 49729,
50176, 50625, 51076, 51529,
51984, 52441, 52900, 53361,
53824, 54289, 54756, 55225,
55696, 56169, 56644, 57121,
57600, 58081, 58564, 59049,
59536, 60025, 60516, 61009,
61504, 62001, 62500, 63001,
63504, 64009, 64516, 65025
};
inline int isqrt(uint16_t x) {
const uint16_t *p = squares;
if (p[128] <= x) p += 128;
if (p[ 64] <= x) p += 64;
if (p[ 32] <= x) p += 32;
if (p[ 16] <= x) p += 16;
if (p[ 8] <= x) p += 8;
if (p[ 4] <= x) p += 4;
if (p[ 2] <= x) p += 2;
if (p[ 1] <= x) p += 1;
return p - squares;
}
A 32 bit version can be downloaded here: https://gist.github.com/3481770
While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):
Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root.
Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).
If you don't mind an approximation, how about this integer sqrt function I cobbled together.
int sqrti(int x)
{
union { float f; int x; } v;
// convert to float
v.f = (float)x;
// fast aprox sqrt
// assumes float is in IEEE 754 single precision format
// assumes int is 32 bits
// b = exponent bias
// m = number of mantissa bits
v.x -= 1 << 23; // subtract 2^m
v.x >>= 1; // divide by 2
v.x += 1 << 29; // add ((b + 1) / 2) * 2^m
// convert to int
return (int)v.f;
}
It uses the algorithm described in this Wikipedia article.
On my machine it's almost twice as fast as sqrt :)
To do integer sqrt you can use this specialization of newtons method:
Def isqrt(N):
a = 1
b = N
while |a-b| > 1
b = N / a
a = (a + b) / 2
return a
Basically for any x the sqrt lies in the range (x ... N/x), so we just bisect that interval at every loop for the new guess. Sort of like binary search but it converges must faster.
This converges in O(loglog(N)) which is very fast. It also doesn't use floating point at all, and it will also work well for arbitrary precision integers.
This is so short that it 99% inlines:
static inline int sqrtn(int num) {
int i = 0;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t" // clean xmm0 for cvtsi2ss
"cvtsi2ss %1, %%xmm0\n\t" // convert num to float, put it to xmm0
"sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
"cvttss2si %%xmm0, %0" // float to int
:"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
return i;
}
Why clean xmm0? Documentation of cvtsi2ss
The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.
GCC Intrinsic version (runs only on GCC):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v4sf xmm0 = {0, 0, 0, 0};
xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
xmm0 = __builtin_ia32_sqrtss(xmm0);
return __builtin_ia32_cvttss2si(xmm0);
}
Intel Intrinsic version (tested on GCC, Clang, ICC):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __m128 xmm0 = _mm_setzero_ps();
xmm0 = _mm_cvt_si2ss(xmm0, num);
xmm0 = _mm_sqrt_ss(xmm0);
return _mm_cvtt_ss2si(xmm0);
}
^^^^ All of them require SSE 1 (not even SSE 2).
Note: This is exactly how GCC calculates (int) sqrt((float) num) with -Ofast. If you want higher accuracy for larger i, then we can calculate (int) sqrt((double) num) (as noted by Gumby The Green in the comments):
static inline int sqrtn(int num) {
int i = 0;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t"
"cvtsi2sd %1, %%xmm0\n\t"
"sqrtsd %%xmm0, %%xmm0\n\t"
"cvttsd2si %%xmm0, %0"
:"=r"(i):"r"(num):"%xmm0");
return i;
}
or
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v2df xmm0 = {0, 0};
xmm0 = __builtin_ia32_cvtsi2sd(xmm0, num);
xmm0 = __builtin_ia32_sqrtsd(xmm0);
return __builtin_ia32_cvttsd2si(xmm0);
}
The following solution computes the integer part, meaning floor(sqrt(x)) exactly, with no rounding errors.
Problems With Other Approaches
using float or double is neither portable nor precise enough
#orlp's isqrt gives insane results like isqrt(100) = 15
approaches based on huge lookup tables are not practical beyond 32 bits
using a fast inverse sqrt is very imprecise, you're better off using sqrtf
Newton's approach requires expensive integer division and a good initial guess
My Approach
Mine is based on the bit-guessing approach proposed on Wikipedia. Unfortunately the pseudo-code provided on Wikipedia has some errors so I had to make some adjustments:
// C++20 also provides std::bit_width in its <bit> header
unsigned char bit_width(unsigned long long x) {
return x == 0 ? 1 : 64 - __builtin_clzll(x);
}
template <typename Int, std::enable_if_t<std::is_unsigned<Int, int = 0>>
Int sqrt(const Int n) {
unsigned char shift = bit_width(n);
shift += shift & 1; // round up to next multiple of 2
Int result = 0;
do {
shift -= 2;
result <<= 1; // make space for the next guessed bit
result |= 1; // guess that the next bit is 1
result ^= result * result > (n >> shift); // revert if guess too high
} while (shift != 0);
return result;
}
bit_width can be evaluated in constant time and the loop will iterate at most ceil(bit_width / 2) times. So even for a 64-bit integer, this will be at worst 32 iterations of basic arithmetic and bitwise operations.
The compile output is only around 20 instructions.
Performance
I have benchmarked my methods against float-bases ones by generating inputs uniformly. Note that in the real world most inputs would be much closer to zero than to std::numeric_limits<...>::max().
for uint32_t this performs about 25x worse than using std::sqrt(float)
for uint64_t this performs about 30x worse than using std::sqrt(double)
Accuracy
This method is always perfectly accurate, unlike approaches using floating point math.
Using sqrtf can provide incorrect rounding in the [228, 232) range. For example, sqrtf(0xffffffff) = 65536, when the square root is actually 65535.99999.
Double precision doesn't work consistently for the [260, 264) range. For example, sqrt(0x3fff...) = 2147483648, when the square root is actually 2147483647.999999.
The only thing that covers all 64-bit integers is x86 extended precision long double, simply because it can fit an entire 64-bit integer.
Conclusion
As I said, this the only solution that handles all inputs correctly, avoids integer division and doesn't require lookup tables.
In summary, if you need a method that is independent of precision and doesn't require gigantic lookup tables, this is your only option.
It might be especially useful in a constexpr context where performance isn't critical and where it could be much more important to get a 100% accurate result.
Alternative Approach Using Newton's Method
Newton's method can be quite fast when starting with a good guess. For our guess, we will round down to the next power of 2 and compute the square root in constant time. For any number 2x, we can obtain the square root using 2x/2.
template <typename Int, std::enable_if_t<std::is_unsigned_v<Int>, int> = 0>
Int sqrt_guess(const Int n)
{
Int log2floor = bit_width(n) - 1;
// sqrt(x) is equivalent to pow(2, x / 2 = x >> 1)
// pow(2, x) is equivalent to 1 << x
return 1 << (log2floor >> 1);
}
Note that this is not exactly 2x/2 because we lost some precision during the rightshift. Instead it is 2floor(x/2).
Also note that sqrt_guess(0) = 1 which is actually necessary to avoid division by zero in the first iteration:
template <typename Int, std::enable_if_t<std::is_unsigned_v<Int>, int> = 0>
Int sqrt_newton(const Int n)
{
Int a = sqrt_guess(n);
Int b = n;
// compute unsigned difference
while (std::max(a, b) - std::min(a, b) > 1) {
b = n / a;
a = (a + b) / 2;
}
// a is now either floor(sqrt(n)) or ceil(sqrt(n))
// we decrement in the latter case
// this is overflow-safe as long as we start with a lower bound guess
return a - (a * a > n);
}
This alternative approach performs roughly equivalent to the first proposal, but is usually a few percentage points faster. However, it heavily relies on efficient hardware division and result can vary heavily.
The use of sqrt_guess makes a huge difference. It is roughly five times faster than using 1 as the initial guess.
In many cases, even exact integer sqrt value is not needed, enough having good approximation of it. (For example, it often happens in DSP optimization, when 32-bit signal should be compressed to 16-bit, or 16-bit to 8-bit, without loosing much precision around zero).
I've found this useful equation:
k = ceil(MSB(n)/2); - MSB(n) is the most significant bit of "n"
sqrt(n) ~= 2^(k-2)+(2^(k-1))*n/(2^(2*k))); - all multiplications and divisions here are very DSP-friendly, as they are only 2^k.
This equation generates smooth curve (n, sqrt(n)), its values are not very much different from real sqrt(n) and thus can be useful when approximate accuracy is enough.
Why nobody suggests the quickest method?
If:
the range of numbers is limited
memory consumption is not crucial
application launch time is not critical
then create int[MAX_X] filled (on launch) with sqrt(x) (you don't need to use the function sqrt() for it).
All these conditions fit my program quite well.
Particularly, an int[10000000] array is going to consume 40MB.
What's your thoughts on this?
On my computer with gcc, with -ffast-math, converting a 32-bit integer to float and using sqrtf takes 1.2 s per 10^9 ops (without -ffast-math it takes 3.54 s).
The following algorithm uses 0.87 s per 10^9 at the expense of some accuracy: errors can be as much as -7 or +1 although the RMS error is only 0.79:
uint16_t SQRTTAB[65536];
inline uint16_t approxsqrt(uint32_t x) {
const uint32_t m1 = 0xff000000;
const uint32_t m2 = 0x00ff0000;
if (x&m1) {
return SQRTTAB[x>>16];
} else if (x&m2) {
return SQRTTAB[x>>8]>>4;
} else {
return SQRTTAB[x]>>8;
}
}
The table is constructed using:
void maketable() {
for (int x=0; x<65536; x++) {
double v = x/65535.0;
v = sqrt(v);
int y = int(v*65535.0+0.999);
SQRTTAB[x] = y;
}
}
I found that refining the bisection using further if statements does improve accuracy, but it also slows things down to the point that sqrtf is faster, at least with -ffast-math.
Or just do a binary search, cant write a simpler version imo:
uint16_t sqrti(uint32_t num)
{
uint16_t ret = 0;
for(int32_t i = 15; i >= 0; i--)
{
uint16_t temp = ret | (1 << i);
if(temp * temp <= num)
{
ret = temp;
}
}
return ret;
}
If you need performance on computing square root, I guess you will compute a lot of them.
Then why not caching the answer? I don't know the range for N in your case, nor if you will compute many times the square root of the same integer, but if yes, then you can cache the result each time your method is called (in an array would be the most efficient if not too large).
This is an addition for those in need of a precide square root for very large integers. The trick is to leverage the fast floating point square root of modern processors and to fix round-off errors.
template <typename T>
T preciseIntegerSqrt(T n)
{
if (sizeof(T) <= 4)
{
return std::sqrt((double)n);
}
else if (sizeof(T) <= 8)
{
T r = std::sqrt((double)n);
return r - (r*r-1 >= n);
}
else
{
if (n == 0) return 0;
T r = 0;
for (T b = (T(1)) << ((std::bit_width(n)-1) / 2); b != 0; b >>= 1)
{
T const k = (b + 2*r) * b;
r |= (n >= k) * b;
n -= (n >= k) * k;
}
return r;
}
}
Explanation: Integers of up to 32 bits do not need a correction, since they can be represented precisely as double-precision floating point numbers. 64-bit integers get along with a very cheap correction. For the general case, refer to Jan Schultke's excellent answer. The code provided here is very slightly faster that that one (10% on my machine, may vary with integer type and hardware).