I have a hashmap containing about half a million entries the key is a string whose values comes as a combination of 5 different inputs. (string concatenation) the domain of each of the input is small but the combination of the 5 inputs gives this huge map (500K items). Now I am thinking of optimizing this structure.
My idea is to hash the input (the combination of 5 inputs) by hashing each individual input and combining that 5 hashes into one single hash (int 32 or 64) and then look up that hash.
My question is there a known data structure that can handle this situation well? and Is it worth doing that optimization? I wanna optimize both memory and run-time.
I am using C++ and std::unordered_map the key is the combined string from the 5 inputs and the output is random. I didn't find any relation between the inputs and outputs (random or serial).
125 458 699 sadsadasd 5 => 56.
125 458 699 sadsadasd 3 => 57.
125 458 699 sadsadasd 4 => 58.
125 458 699 sadsadasd 5 => 25.
125 458 699 gsdfsds 3 => 89.
The domain of each of the inputs is small (the 4th input has 2K different values while the other inputs can have about only 20 different values).
You could use GNU perf to generate a perfect hash function for your keys.
It seems to me that there is no way to reduce the size of your keys that will result in reliable extraction. Hashing the 5 inputs into 1 integer is a one-way function that will prevent you from performing reliable lookups.
The way round this would be to keep a translation table, but that's actually more overhead because each distinct tuple of inputs would require the storage for 2 hashes and the tuple.
I think you're best off using a std::tuple<int, int, int, std::string, int> as the key type in a single map.
If you use a std::map<tuple<>, data_type> you won't need to provide a hashing function. If you stay with the unordered_map, you'll need to provide one since std::tuple has no default hash<> specialisation.
Related
I have on my configuration file 100 values.
Each value is build of 2 char that can be like 90 or AA or 04 or TR or FE
I want to generate hash code of each value - and store them in array that contain 100 element - and each of the values from the configuration will be save in the hash code index in the array.
The question:
How to create hash code from 2 char that the hash code is limited between 0 to 99
What you need in your specific case (mapping a fixed set of 2-byte sequences to consecutive numbers) is called perfect hashing.
While you could implement it yourself, there's an open-source tool called gperf which can generate the code for you:
There are options for generating C or C++ code, for emitting switch statements or nested ifs instead of a hash table, and for tuning the algorithm employed by gperf.
So my professor just assigned this homework assignment. I know my fare share of hashing techniques, but I have absolutely no idea how to go about not losing a lot of points due to collisions, because 1 Million strings will literally brute force collisions into my hash table.
What should I focus on?
Creating a really good re-hashing technique to detect when a collision occurs and appropriately re-hash
Focus on how to convert the strings into unique integers as to avoid collisions using some kind of prime number based modulus.
Or maybe I'm just misunderstanding the assignment completely. How would you guys go about solving this. Any ideas would be really helpful.
The task is to create a hashfunction with zero collisions. TonyD just calculated the expected collisions to be 116. According to the grading you will get zero points for a hashfunction with 116 collisions.
The professor gave a hint to use unordered_map which doesnt help for designing hashfunctions. It may be a trick question...
How would you design a function which returns a repeatable, unique number for 1 million inputs?
Your teacher's asking you to hash 1 million strings and you have 2^32 = 4,294,967,296 distinct 32-bit integer values available.
With 20 character random strings, there's massively more possible strings than hash values, so you can't map specific strings onto specific hash values in a way that limits the collision potential (i.e. say you had <= 2^32 potential strings because the string length was shorter, or the values each character was allowed to take were restricted - you'd have a chance at a perfect hash function: a formula mapping each to a known distinct number).
So, you're basically left having to try to randomly but repeatably map from strings to hash values. The "Birthday Paradox" then kicks in, meaning you must expect quite a lot of collisions. How many? Well - this answer provides the formula - for m buckets (2^32) and n inserts (1,000,000):
expected collisions = n - m * (1 - ((m-1)/m)^n)
= 1,000,000 - 2^32 * (1 - ((2^32 - 1) / 2^32) ^ 1,000,000)
= 1,000,000 - 2^32 * (1 - 0.99976719645926983712557804052625)
~= 1,000,000 - 999883.6
~= 116.4
Put another way, the very best possible hash function would on average - for random string inputs - still have 116 collisions.
Your teacher says:
final score for you is max{0, 200 – 5*T}
So, there's no point doing the assignment: you're more likely to have a 24 carat gold meteor land in your front garden than get a positive score.
That said, if you want to achieve the lowest number of collisions for the class, a lowish-performance (not particularly cache friendly) but minimal collision option is simply to have an array of random data...
uint32_t data[20][256] = { ... };
Download some genuinely random data from an Internet site to populate it with. Discard any duplicate numbers (in C++, you can use a std:set<> to find them). Index by character position (0..19) then character value, generating your hash by XORing the values.
Illustration of collisions
If unconvinced by the information above, you can generate a million random 32-bit values - as if they were hashes of distinct strings - and see how often the hash values repeat. Any given run should produce output not too far from the 116 collision average calculated above.
#include <iostream>
#include <map>
#include <random>
int main()
{
std::random_device rd;
std::map<unsigned, int> count;
for (int i = 0; i < 1000000; ++i)
++count[rd()];
std::map<int, int> histogram;
for (auto& c : count)
++histogram[c.second];
for (auto& h : histogram)
std::cout << h.second << " hash values generated by " << h.first << " key(s)\n";
}
A few runs produced output...
$ ./poc
999752 hash values generated by 1 key(s)
124 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999796 hash values generated by 1 key(s)
102 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999784 hash values generated by 1 key(s)
108 hash values generated by 2 key(s)
$ ./poc
999744 hash values generated by 1 key(s)
128 hash values generated by 2 key(s)
My question concerns the trade off between execution speed and memory usage when designing a class that will be instantiated thousands or millions of times and used differently in different contexts.
So I have a class that contains a bunch of numerical properties (stored in int and double). A simple example would be
class MyObject
{
public:
double property1;
double property2;
...
double property14
int property15;
int property16;
...
int property25;
MyObject();
~MyObject();
};
This class is used by different programs that instantiate
std::vector<MyObject> SetOfMyObjects;
that may contain as much as a few millions of elements. The thing is that depending on the context, some or many properties may remain unused (we do not need to compute them in this given context), implying that the memory for millions of useless int and double is allocated. As I said, the usefulness and uselessness of the properties depend on the context, and I would like to avoid writing a different class for each specific contexts.
So I was thinking about using std::maps to assign memory only for the properties I use. For example
class MyObject
{
public:
std::map<std::string, double> properties_double;
std::map<std::string, int> properties_int;
MyObject();
~MyObject();
};
such that if "property1" has to be computed, it would stored as
MyObject myobject;
myobject.properties_double["property1"] = the_value;
Obviously, I would define proper "set" and "get" methods.
I understand that accessing elements in a std::map goes as the logarithm of its size, but since the number of properties is quite small (about 25), I suppose that this should not slow down the execution of the code too much.
Am I overthinking this too much? Do you think using std::map is a good idea? Any suggestion from more seasoned programmers would be appreciated.
I don't think this is your best option, for 25 elements, you will not benefit that much from using a map in terms of lookup performance. Also, it depends on what kinds of properties are you going to have, if it is a fixed set of properties as in your example, then string lookup would be a waste of memory and CPU cycles, you could go for an enum of all properties or just an integer and use a sequential container for the properties each element has. For such a small number of possible properties, lookup time will be lower than a map because of cache friendliness and integer comparisons, and memory usage will be lower too. For such a small set of properties this solution is marginally better.
Then there is the problem that an int is usually twice as small as a double. And they are different types. So it is not directly possible to store both in a single container, but you could have enough space for a double in each element, and either use a union or just read/write an int from/to the address of the double if the property "index" is larger than 14.
So you can have something as simple as:
struct Property {
int type;
union {
int d_int;
double d_double;
};
};
class MyObject {
std::vector<Property> properties;
};
And for type 1 - 14 you read the d_double field, for type 15 - 25 the d_int field.
BENCHMARKS!!!
Out of curiosity I did some testing, creating 250k objects, each with 5 int and 5 double properties, using a vector, a map and a hash for the properties, and measured memory usage and time taken to set and get the properties, ran each test 3 times in a row to see impact on caching, calculate checksum for getters to verify consistency, and here are the results:
vector | iteration | memory usage MB | time msec | checksum
setting 0 32 54
setting 1 32 13
setting 2 32 13
getting 0 32 77 3750000
getting 1 32 77 3750000
getting 2 32 77 3750000
map | iteration | memory usage MB | time msec | checksum
setting 0 132 872
setting 1 132 800
setting 2 132 800
getting 0 132 800 3750000
getting 1 132 799 3750000
getting 2 132 799 3750000
hash | iteration | memory usage MB | time msec | checksum
setting 0 155 797
setting 1 155 702
setting 2 155 702
getting 0 155 705 3750000
getting 1 155 705 3750000
getting 2 155 706 3750000
As expected, the vector solution is by far the fastest and most efficient, although it is most influenced by cold cache, even running cold it is way faster than a map or hash implementation.
On a cold run, the vector implementation is 16.15 times faster than map and 14.75 times faster than hash. On a warm run it is even faster - 61 times faster and 54 times faster respectively.
As for memory usage, the vector solution is far more efficient as well, using over 4 times less memory than the map solution and almost 5 times less than the hash solution.
As I said, it is marginally better.
To clarify, the "cold run" is not only the first run but also the one inserting the actual values in the properties, so it is fairly illustrative of the insert operations overhead. None of the containers used preallocation so they used their default policies of expanding. As for the memory usage, it is possible it doesn't accurately reflect actual memory usage 100% accurately, since I use the entire working set for the executable, and there is usually some preallocation taking place on OS level as well, it will most likely be more conservative as the working set increases. Last but not least, the map and hash solutions are implemented using a string lookup as the OP originally intended, which is why they are so inefficient. Using integers as keys in the map and hash produces far more competitive results:
vector | iteration | memory usage MB | time msec | checksum
setting 0 32 55
setting 1 32 13
setting 2 32 13
getting 0 32 77 3750000
getting 1 32 77 3750000
getting 2 32 77 3750000
map | iteration | memory usage MB | time msec | checksum
setting 0 47 95
setting 1 47 11
setting 2 47 11
getting 0 47 12 3750000
getting 1 47 12 3750000
getting 2 47 12 3750000
hash | iteration | memory usage MB | time msec | checksum
setting 0 68 98
setting 1 68 19
setting 2 68 19
getting 0 68 21 3750000
getting 1 68 21 3750000
getting 2 68 21 3750000
Memory usage is much lower for hash and map, while still higher than vector, but in terms of performance the tables are turned, while the vector solution sill wins at inserts, at reading and writing the map solution takes the trophy. So there's the trade-off.
As for how much memory is saved compared to having all the properties as object members, by just a rough calculation, it would take about 80 MB of RAM to have 250k such objects in a sequential container. So you save like 50 MB for the vector solution and almost nothing for the hash solution. And it goes without saying - direct member access would be much faster.
TL;DR: it's not worth it.
From carpenters we get: measure twice, cut once. Apply it.
Your 25 int and double will occupy on a x86_64 processor:
14 double: 112 bytes (14 * 8)
11 int: 44 bytes (11 * 4)
for a total of 156 bytes.
A std::pair<std::string, double> will, on most implementation, consume:
24 bytes for the string
8 bytes for the double
and a node in the std::map<std::string, double> will add at least 3 pointers (1 parent, 2 children) and a red-black flag for another 24 bytes.
That's at least 56 bytes per property.
Even with a 0-overhead allocator, any time you store 3 elements or more in this map you use more than 156 bytes...
A compressed (type, property) pair will occupy:
8 bytes for the property (double is the worst case)
8 bytes for the type (you can choose a smaller type, but alignment kicks in)
for a total of 16 bytes per pair. Much better than map.
Stored in a vector, this will mean:
24 bytes of overhead for the vector
16 bytes per property
Even with a 0-overhead allocator, any time you store 9 elements or more in this vector you use more than 156 bytes.
You know the solution: split that object.
You're looking up objects by name that you know will be there. So look them up by name.
I understand that accessing elements in a std::map goes as the logarithm of its size, but since the number of properties is quite small (about 25), I suppose that this should not slow down the execution of the code too much.
You will slow down your program by more than one order of magnitude. Lookup of a map may be O(logN) but it's O(LogN) * C. C will be huge compared to direct access of properties (thousands of times slower).
implying that the memory for millions of useless int and double is allocated
A std::string is at least 24 bytes on all the implementations I can think of - assuming you keen the names of properties short (google 'short string optimisation' for details).
Unless 60% of your properties are unpopulated there is no saving using a map keyed by string at all.
With so many objects and small map object in each you may hit another problem - memory fragmentation. It could be usable to have std::vector with std::pair<key,value> in it instead and do lookup (I think binary search should be sufficient, but it depends on your situation, it could be cheaper to do linear lookup but not to sort the vector). For property key I would use enum instead of string, unless later is dictated by interface (which you did not show).
Just an idea (not compiled/tested):
struct property_type
{
enum { kind_int, kind_double } k;
union { int i; double d; };
};
enum prop : unsigned char { height, widht, };
typedef std::map< std::pair< int/*data index*/, prop/*property index*/ >, property_type > map_type;
class data_type
{
map_type m;
public:
double& get_double( int i, prop p )
{
// invariants...
return m[ std::pair<int,prop>(i,p) ].d;
}
};
Millions of ints and doubles is still only hundreds of megabytes of data. On a modern computer that may not be a huge issue.
The map route looks like it will be a waste of time but there is an alternative you could use that saves memory while retaining decent performance characteristics: store the details in a separate vector and store an index into this vector (or -1 for unassigned) in your main data type. Unfortunately, your description doesn't really indicate how the property usage actually looks but I'm going to guess you can sub-divide into properties that are always, or usually, set together and some that are needed for every node. Let's say you subdivide into four sets: A, B, C and D. The As are needed for every node whereas B, C and D are rarely set but all elements are typically modified together, then modify the struct you're storing like so:
struct myData {
int A1;
double A2;
int B_lookup = -1;
int C_lookup = -1;
int D_lookup = -1;
};
struct myData_B {
int B1;
double B2;
//etc.
};
// and for C and D
and then store 4 vectors in your main class. When a property in the Bs in accessed you add a new myData_B to the vector of Bs (actually a deque might be a better choice, retaining fast access but without the same memory fragmentation issues) and set the B_lookup value in the original myData to the index of the new myData_B. And the same for Cs and Ds.
Whether this is worth doing depends on how few of the properties you actually access and how you access them to together but you should be able to modify the idea to your tastes.
I have a use case where a set of strings will be searched for a particular string, s. The percent of hits or positive matches for these searches will be very high. Let's say 99%+ of the time, s will be in the set.
I'm using boost::unordered_set right now, and even with its very fast hash algorithm, it takes about 40ms 600ms on good hardware a VM to search the set 500,000 times. Yeah, that's pretty good, but unacceptable for what I'm working on.
So, is there any sort of data structure optimized for a high percent of hits? I cannot precompute the hashes for the strings coming in, so I think I'm looking at a complexity of \$O(avg length of string)\$ for a hash set like boost::unordered_set. I looked at Tries, these would probably perform well in the opposite case where there is rarely hits, but not really any better than hash sets.
edit: some other details with my particular use case:
the number of strings in the set is around 5,000. The longest string is probably no more than 200 chars. Search gets called again and again with the same strings, but they are coming in from an outside system and I cannot predict what the next string will be. The exact match rate is actually 99.975%.
edit2: I did some of my own benchmarking
I collected 5,000 of the strings that occur in the real system. I created two scenarios.
1) I loop over the list of known strings and do a search for them in the container. I do this for 500,000 searches("hits").
2) I loop through a set of strings known not to be in the container, for 500,000 searches ("misses").
(Note - I'm interested in hashing the data in reverse because eyeballing my data, I noticed that there are a lot of common prefixes and the suffixes differ - at least that is what it looks like.)
Tests done on a virtualbox CentOS 5.6 VM running on a macbook host.
hits (ms) misses (ms)
boost::unordered_set with default hash and no reserved size: 591.15 441.39
tr1::unordered_set with default hash 191.09 143.80
boost::unordered_set with a reserve size set: 579.31 431.54
boost::unordered_set w/custom hash (hash on the last 15 chars + str size): 357.34 812.13
boost::unordered_set w/custom hash (hash on the last 25 chars + str size): 362.60 795.33
trie: 1809.34 58.11
trie with reversed insertion/search: 2806.26 311.14
In my tests, where there are a lot of matches, the tr1 set is the best. Where there are a lot of misses, the Trie wins big.
my test loop looks like this, where function_set is the container being tested loaded with 5,000 strings, and functions is a vector of either all the strings in the container or a bunch of strings that are not in the container.
while (searched < kTotalSearches) {
for(std::vector<std::string>::const_iterator i = functions.begin(); i != functions.end(); ++i) {
function_set.count(*i);
searched++;
if (searched == kTotalSearches)
break;
}
}
std::cout << searched << " searches." << std::endl;
I'm pretty sure that Tries is what you are looking for. You are guaranteed not to go down a number of nodes greater than the length of your string. Once you've reached a leaf, then there might be some linear search if there are collisions for this particular node. It depends on how you build it. Since you're using a set I would assume that this is not a problem.
The unordered_set will have a complexity of at worse O(n), but n in this case is the number of nodes that you have (500k) and not the number of characters you are searching for (probably less than 500k).
After edit:
Maybe what you really need is a cache of the results after your search algo succeeded.
This question piqued my curiosity so I did a few tests to satisfy myself with the following results. A few general notes:
The usual caveats about benchmarking apply (don't trust my numbers, do your own benchmarks with your specific use case and data, etc...).
Tests were done using MSVS C++ 2010 (speed optimized, release build).
Benchmarks were run using 10 million loops to improve timing accuracy.
Names were generated by randomly concatenating 20 different strings fragments into strings ranging from 4 to 65 characters in length.
Names included only letters and some tests (trie) were case-insensitive for simplicity, though there's no reason the methods can't be extended to include other characters.
Tests try to match the 99.975% hit rate given in the question.
Test Descriptions
Basic description of the tests run with the relevant details:
String Iteration -- Simply iterates through the function name for a baseline time comparison.
Map -- std::unordered_map<std::string, int>
Set -- std::unordered_set<std::string>
BoostSet -- boost::unordered_set<std::string>, v1.47.0
CharMap -- std::unordered_map<const char*, int>
CharSet -- std::unordered_set<const char*>
FastMap -- Simply a std::unordered_map<> using a custom FNV-1a hash algorithm.
FastSet -- Simply a std::unordered_set<> using a custom FNV-1a hash algorithm.
CustomMap -- A basic hash map I wrote myself years ago.
Trie -- A standard trie downloaded from Google code.
CustomTrie -- A bare-bones trie I wrote myself.
BinarySearch -- Using std::binary_search() on a sorted std::vector<std::string>.
SortArrayMap -- An attempt to use a size_t VectorIndex[26][26][26][26][26] array to index into a sorted array.
PerfectMap -- A std::unordered_map<> using a perfect hash from gperf.
PerfectWordSet -- Using the gperf is_word_set() function directly.
PerfectWordSetFunc -- Same as PerfectWordSet but called in a function instead of inline.
PerfectWordSetThread -- Same as PerfectWordSet but work is split into N threads (standard Window threads). No synchronization is used except for waiting for the threads to finish.
Test Results (Mostly Hits)
Results sorted from slowest to fastest (for the case of mostly hits, ~99.975%):
Trie -- 9100 ms
SortArrayMap -- 6600 ms
PerfectWordSetFunc -- 4050 ms
CustomTrie -- 3470 ms
BinarySearch -- 3420 ms
CustomMap -- 2700 ms
CharSet -- 1300 ms
CharMap -- 1300 ms
BoostSet -- 1200 ms
FastSet -- 970 ms
FastMap -- 930 ms
Original Poster -- 800 ms (estimated)
Set -- 730 ms
Map -- 690 ms
PerfectMap -- 650 ms
PerfectWordSet -- 500 ms
PerfectWordSetThread(1) -- 500 ms
StringIteration -- 350 ms
PerfectWordSetThread(2) -- 260 ms
PerfectWordSetThread(4) -- 150 ms
PerfectWordSetThread(32) -- 125 ms
PerfectWordSetThread(8) -- 120 ms
PerfectWordSetThread(16) -- 110 ms
Test Results (Mostly Misses)
Results sorted from slowest to fastest (for the case of mostly misses, ~0.1% hits):
BinarySearch -- ? (took too long)
SortArrayMap -- 8050 ms
Trie -- 3200 ms
CustomMap -- 1700 ms
BoostSet -- 920 ms
CustomTrie -- 850 ms
FastMap -- 590 ms
FastSet -- 580 ms
CharSet -- 550 ms
CharMap -- 550 ms
StringIteration -- 350 ms
Set -- 330 ms
Map -- 330 ms
PerfectMap -- 280 ms
PerfectWordSet -- 140 ms
PerfectWordSetThread(1) -- 130 ms
PerfectWordSetThread(2) -- 75 ms
PerfectWordSetThread(4) -- 45 ms
PerfectWordSetThread(32) -- 45 ms
PerfectWordSetThread(8) -- 40 ms
PerfectWordSetThread(16) -- 35 ms
Discussion
My first guess was that a trie would be a good fit for this sort of thing but from the results the opposite actually appears to be true. Thinking about it some more this makes sense and is along the same reasons to not use a linked-list.
I assume you may be familiar with the table of latencies that every programmer should know. In your case you have 500k lookups executing in 40ms, or 80ns/lookup. At that scale you easily lose if you have to access anything not already in the L1/L2 cache. A trie is really bad for this as you have an indirect and probably non-local memory access for every character. Given the size of the trie in this case I couldn't figure any way of getting the entire trie to fit in cache to improve performance (though it may be possible). I still think that even if you did get the trie to fit entirely in L2 cache you would lose with all the indirection required.
The std::unordered_ containers actually do a very good job of things out of the box. In fact, in trying to speed them up I actually made them slower (in the poorly named FastMap and FastSet trials).
Same thing with trying to switch from std::string to const char * (about twice as slow).
The boost::unordered_set<> was twice as slow as the std::unordered_set<> and I don't know if that is because I just used the built-in hash function, was using a slightly old version of boost, or something else. Have you tried std::unordered_set<> yourself?
By using gperf you can easily create a perfect hash function if your set of strings is known at compile time. You could probably create a perfect hash at runtime as well, depending on how often new strings are added to the map. This gets you a 23% speed increase over the standard map implementation.
The PerfectWordSetThread tests simply use the perfect hash and splits the work up into 1-32 threads. This problem is perfectly parallel (at least the benchmark is) so you get almost a 5x boost of performance in the 16 threads case. This works out to only 6.3ms/500k lookups, or 13 ns/lookup...a mere 50 cycles on a 4GHz processor.
The StringIteration case really points out how difficult it is going to be to get much faster. Just iterating the string being found takes 350 ms, or 70% of the time compared to the 500 ms map case. Even if you could perfectly guess each string you would still need this 350 ms (for 10 million lookups) to actually compare and verify the match.
Edit: Another thing that illustrates how tight things are is the difference between the PerfectWordSetFunc at 4050 ms and PerfectWordSet at 500 ms. The only difference between the two is that one is called in a function and one is called inline. Calling it as a function reduces the speed by a factor of 8. In basic pseudo-code this is just:
bool IsInPerfectWordSet (string Match)
{
return in_word_set(Match);
}
//Inline benchmark: PerfectWordSet
for i = 1 to 10,000,000
{
if (in_word_set(SomeString)) ++MatchCount;
}
//Function call benchmark: PerfectWordSetFunc
for i = 1 to 10,000,000
{
if (IsInPerfectWordSet(SomeString)) ++MatchCount;
}
This really highlights the difference in performance that inline code/functions can make. You also have to be careful in making sure what you are measuring in a benchmark. Sometimes you would want to include the function call overhead, and sometimes not.
Can You Get Faster?
I've learned to never say "no" to this question, but at some point the effort may not be worth it. If you can split the lookups into threads and use a perfect, or near-perfect, hash function you should be able to approach 100 million lookup matches per second (probably more on a machine with multiple physical processors).
A couple ideas I don't have the knowledge to attempt:
Assembly optimization using SSE
Use the GPU for additional throughput
Change your design so you don't need fast lookups
Take a moment to consider #3....the fastest code is that which never needs to run. If you can reduce the number of lookups, or reduce the need for an extremely high throughput, you won't need to spend time micro-optimizing the ultimate lookup function.
If the set of strings is fixed at compile time (e.g. it is a dictionnary of known human words), you could perhaps use a perfect hash algorithm, and use the gperf generator.
Otherwise, you might perhaps use an array of 26 hash tables, indexed by the first letter of the word to hash.
BTW, perhaps using a sorted array of these strings, with a dichotomical access, might be faster (since log 5000 is about 13), or a std::map or a std::set.
At last, you might define your own hashing function: perhaps in your particular case, hashing only the first 16 bytes could be enough!
If the set of strings is fixed, you could consider generating a dichotomical search on it (e.g. code a script to generate a function with 5000 tests, but only log 5000 being executed).
Also, even if the set of strings is slightly variable (e.g. change from one program run to the next, but stays constant during a single run), you might even consider generating the function (by emitting C++ code, then compiling it) on the fly and dlopen-ing it.
You really should benchmark and try several solutions! It probably is more an engineering issue than an algorithmic one.
I'm writting interpreter of language.
There is problem: I want to create type-dictionary, where you can put value of any type by index, that value of any type (simple[int,float,string] or complex[list,array,dictionary] of simple types or of complex of simple types ...). That is the same like in python-lang.
What algorithm of hash-function should I use?
For strings there are many examples of hashes - the simplest: sum of all characters multiplied by 31, divided by HASH_SIZE, that simple number.
But for DIFFERENT TYPES, I think, It must be more complicated algorithm.
I find SHA256, but don't know, how use "unsigned char[32]" result type for adressing in hash-table - it is much more than RAM in computer.
thank you.
There are hash tables in C++11, newest C++ standard - std::unordered_map, std::unordered_set.
EDIT:
Since every type has different distribution, usually every type has its own hash function. This is how it's done in Java (.hashCode() method inherited from Object), C#, C++11 and many other implementations.
EDIT2:
Typical hash function does two things:
1.) Create object representation in a natural number. (this is what .hashCode() in Java does)
For example - string "CAT" can be transformed to:
67 * 256^2 + 65 * 256^1 + 84 = 4407636
2.) Map this number to position in array.
One of the way to do this is:
integer_part(fractional_part(k*4407636)*m)
Where k is a constant (Donald Knuth in his book Art of Programming recommends (sqrt(5)+1)/2), m is size of your hash table and fractional_part and integer_part (obviously) calculate fractional part and integer part of real number.
In your hash table implementation, you need to handle collisions, especially when there are much more possible keys than size of your hash table.
EDIT3:
I read more on the subject, and it looks like
67 * 256^2 + 65 * 256^1 + 84 = 4407636
is really bad way to do hash_code. This is because, "somethingAAAAAABC" and "AAAAAABC" give exactly the same hash code.
Well, a common approach is to define the hash function as a method belonging to the type.
That way you can call different algorithms for different types through a common API.
That ,of course, entails that you define wrapper classes for every baisc "c type" that you want to use in your interpreter.