I have a very long string that I need to compare for equality. Since comparing them char by char is very time consuming, I like to create a hash for the string.
I like the generated hash code be unique ( or the chance that two string with the same hash generated, be very small). I think creating an int from a string as hash is not strong enough to eliminate of having two different string with the same hash code, so I am looking for a string hash code.
Am I right that the above assumption?
To clarify, assume that I have a string of say 1K length and I create a hash code of 10 char, then comparison hash codes speed up by 100 times.
The question that I have is how to create such hash code in c++?
I am developing on windows using visual studio 2012.
To be useful in this case, the hash code must be quick to
calculate. Using anything larger than the largest words
supported by the hardware (usually 64 bits) may be counter
productive. Still, you can give it a try. I've found the
following to work fairly well:
unsigned long long
hash( std::string const& s )
{
unsigned long long results = 12345; // anything but 0 is probably OK.
for ( auto current = s.begin(); current != s.end(); ++ current ) {
results = 127 * results + static_cast<unsigned char>( *current );
}
return results;
}
Using a hash like this will probably not be advantageous,
however, unless most of the comparisons are with strings that
aren't equal, but have long common initial sequences. Remember
that if the hashes are equal, you still have to compare the
strings, and that comparison only has to go until the first
characters which aren't equal. (In fact, most of the comparison
functions I've seen start by comparing length, and only compare
characters if the strings are of equal length.)
There are a lot of hashing algorithms present which you may use.
If you want to implement one by yourself, then a simple one could be to take the ascii for each character and align it with 0(i.e. a = 1, b = 2...) and multiply it with the character index in the string. Keep adding these values and store it as a hash value for a particular string.
For example tha hash value for abc would be:
HASH("abc") = 1*1 + 2*2 + 3*3 = 14;
The probability of collision lowers as the string length increases(Considering your strings will be lengthy).
There are many known hash algorithms available. For example MD5, SHA1, etc. You should not need to implement your own algorithm but use one of the available ones. Use the search engine of your choice to find implementations like this one.
It really depends what your hard requirements are. If you have hard requirements such as "search may never take more than so and so long", then it's possible that no solution is applicable. If your intent is simply to speed up a large number of searches, then a simple, short hash will do fine.
While it is generally true that hashing a 1000-character string to an integer (a single 32-bit or 64-bit number) can, and eventually will produce collisions, this is not something to be concerned about.
A 10-charcter hash will also produce collisions. This is a necessary consequence of the fact that 1000 > 10. For every 10-character hash, there exist 100 1000-character strings1.
The important question is whether you will actually see collisions, how often you will see them, and whether it matters at all. Whether (or how likely) you see a collision depends not on the length of the strings, but on the number of distinct strings.
If you hash 77,100 strings (longer than 4 characters) using a 32-bit hash, you have a 50% chance of encountering a collision for each new hash. At 25,000 strings, the likelihood is only somewhere around 5-6%. At 1000 strings, the likelihood is approximately 0.1%.
Note that when I say "50% at 77,100 strings", this does not mean that your chance of actually encountering a collision is that high. It's merely the chance of having two strings with the same hash value. Unless that's the case for the majority of strings, the chance of actually hitting one is again a lot lower.
Which means no more and no less than for most usage cases, it simply doesn't matter. Unless you want to hash hundred thousands of strings, stop worrying now and use a 32-bit hash.
Otherwise, unless you want to hash billions of strings, stop worrying here and use a 64-bit hash.
Thing is, you must be prepared to handle collisions in any case because as long as you have 2 strings, the likelihood for a collision is never exactly zero. Even hashing only 2 or 3 1000-character strings into a 500-byte hash could in principle have a collision (very unlikely but possible).
Which means you must do a string comparison if the hash matches in either case, no matter how long (or how good or bad) your hash is.
If collisions don't happen every time, they're entirely irrelevant. If you have a lot of collisions in your table and encounter one, say, on 1 in 10,000 lookups (which is a lot!), it has no practical impact. Yes, you will have to do a useless string comparison once in 10,000 lookups, but the other 9,999 work by comparing a single integer alone. Unless you have a hard realtime requirement, the measurable impact is exactly zero.
Even if you totally screw up and encounter a collision on every 5th search (pretty desastrous case, that would mean that roughly 800 million string pairs collide, which is only possible at all with at least 1,6 billion strings), this still means that 4 out of 5 searches don't hit a collision, so you still discard 80% of non-matches without doing a comparison.
On the other hand, generating a 10-character hash is cumbersome and slow, and you are likely to create a hash function that has more collision (because of bad design) than a readily existing 32 or 64 bit hash.
Cryptographic hash functions are certainly better, but they run slower than their non-cryptographic counterparts too, and the storage required to store their 16 or 32 byte hash values is much larger, too (at virtually no benefit, to most people). This is a space/time tradeoff.
Personally, I would just use something like djb2, which can be implemented in 3 lines of C code, works well, and runs very fast. There exist of course many other hash functions that you could use, but I like djb2 for its simplicity.
Funnily, after reading James Kanze's answer, posted code seems to be a variation of djb2, only with a different seed and a different multiplier (5381 and 33, respectively).
In the same answer, the remark about comparing string lengths first is a good tip as well. It's noteworthy that you can consider a string's lenght a form of "hash function", too (albeit a rather weak one, but one that often comes "for free").
1However, strings are not some "random binary garbage" as the hash is. They are structured, low-entropy data. Insofar, the comparison does not really hold true.
Well, I would first compare string lenghts. If they match, then I'd start comparing using an algorithm that uses random positions to test character equality, and stop at first difference.
The random positions would be obtained from a stringLength sized vector, filled with random ints ranging from 0 to stringLength-1. I haven't measured this method, though, it's just an idea. But this would save you the concerns of hash collisions, while reducing comparison times.
Related
I was wondering that unordered_set uses hashing, so that should be faster in the case of integers than in the case of strings. The same would be the case for unordered_map. I found no definite answer on the web. It will be great if someone can clarify this.
Is there any difference in perofrmance of unordered_set (C++) in case of strings vs in case of integer?
There can be. The language specification doesn't have guarantees one way or the other.
You can verify whether this is the case for your program on your target system by measuring the performance.
If you're considering whether to use string itself as the key, or a separately hashed string (i.e. integer), then technically separate hashing is more expensive since the integer would be hashed again. That said, hashing an integer is trivial (I think it may be the identity function), so this might have no noticeable effect.
Separate hashing + storing integer does have potential advantage: You can pre-hash the string keys once, and reuse the hashed integer key, while a map with string keys requires the key to be re-hashed on every lookup. Whether this is useful in your case depends on what you're going to do with the map.
The abstract answer would be "depends on implementation and other details" like sizes of keys and containers. Standard doesn't specify anything around strings vs ints, so you should not expect that there is globally valid answer.
On popular platforms like gcc / clang on on x86 / x86_64 your guess seems about right. I have experience of getting essential performance wins after replacing string keys in maps with ints or pointers.
Still, there're might be specific circumstances when strings will beat ints even on mentioned platforms.
Hashing functions aren't specified by the C++ Standard.
That said, GCC, Clang and Visual C++ all use an identify hash for integers - meaning the std::hash<> specialised for integer types returns its input. Visual C++ uses power-of-2 bucket counts, while GCC uses prime numbers. Consequently, certain inputs are extremely collision-prone on Visual C++, e.g. pointers to objects with N-byte alignment - where N is largish - that have been converted to numbers will all collide at buckets 0, N, 2N, 3N etc. with all the buckets in between storing no data. On the other hand, if the integers are random enough that they happen to distribute well across the buckets without excessive collisions (which is much more likely with GCC's prime bucket count), then an identity hash saves CPU time in trying to further process them.
GCC uses MURMUR32 hashing for strings, while Visual C++ does some simple xoring and shifting on 10 characters roughly evenly sampled along the string (so, GCC is slower but massively better hash quality, particularly for things like same-length directory/filename paths with common prefixes and just an incrementing code at the end, where Visual C++ may only incorporate a single different character into the hash). Compared to a string storing its text inside itself (a technique known as Short String Optimisation or SSO) or an integer, any string storing longer text in dynamically allocated memory ("heap"), will depend on at least one extra least one extra cache line to reach the text (and on modern x86 architectures, an extra cache fault may be needed for each 64-byte chunk of the string accessed during hashing).
It is possible to create an object to store a string and a hash - calculated once - but that's not exactly what the question asks about, and after finding a match on hash value you still need to compare the entire string content to be certain of a match.
In conclusion, if you use the default identify hashing with collision-prone keys on Visual C++, it may be slower than using strings (if the strings hash with fewer collisions, which is far from certain). But, in most cases using integer keys will be faster.
If you really care, always benchmark on your own system and dataset.
I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.
std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};
My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.
I need the fastest possible algorithm to test received string.
Right now my app uses this algo:
std::wstring testedStr;
for (auto & seed : seeds)
{
if (boost::icontains(testedStr, seed))
{
return true;
}
}
return false;
It works well, but I'm not sure that this is the most efficient way.
How is it possible to implement the algorithm in order to achieve better performance?
This is a Windows app. App receives valid std::wstring strings.
Update
For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com
If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.
This is also known as a trie, or Radix Tree.
For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:
A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").
There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.
Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.
This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.
Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)
You can use Aho–Corasick algorithm
It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.
It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)
Advantages:
It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)
You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)
You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.
If you have C++11 you already have a standard regex library
#include <string>
#include <regex>
int main(){
std::regex self_regex("google|yahoo|stackoverflow");
regex_match(input_string ,self_regex);
}
I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)
One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm
Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.
Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.
The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).
To make the comparison case insensitive you just use tolower on the seed and on your string.
Because number of string is not big (~100), you can use next algo:
Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum.
Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array.
After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.
It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.
So, let's say R is length of big string. Then looping over it will take O(R).
Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.
Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block.
And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).
Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.
So, finally that should give O(R), with quite small O.
This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.
As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.
This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.
This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.
This answer determines if two strings are permutations by comparing their contents. If they contain the same number of each character, they are obviously permutations. This is accomplished in O(N) time.
I don't like the answer though because it reinvents what is_permutation is designed to do. That said, is_permutation has a complexity of:
At most O(N2) applications of the predicate, or exactly N if the sequences are already equal, where N=std::distance(first1, last1)
So I cannot advocate the use of is_permutation where it is orders of magnitude slower than a hand-spun algorithm. But surely the implementer of the standard would not miss such an obvious improvement? So why is is_permutation O(N2)?
is_permutation works on almost any data type. The algorithm in your link works for data types with a small number of values only.
It's the same reason why std::sort is O(N log N) but counting sort is O(N).
It was I who wrote that answer.
When the string's value_type is char, the number of elements required in a lookup table is 256. For a two-byte encoding, 65536. For a four-byte encoding, the lookup table would have just over 4 billion entries, at a likely size of 16 GB! And most of it would be unused.
So the first thing is to recognize that even if we restrict the types to char and wchar_t, it may still be untenable. Likewise if we want to do is_permutation on sequences of type int.
We could have a specialization of std::is_permutation<> for integral types of size 1 or 2 bytes. But this is somewhat reminiscent of std::vector<bool> which not everyone thinks was a good idea in retrospect.
We could also use a lookup table based on std::map<T, size_t>, but this is likely to be allocation-heavy so it might not be a performance win (or at least, not always). It might be worth implementing one for a detailed comparison though.
In summary, I don't fault the C++ standard for not including a high-performance version of is_permutation for char. First because in the real world I'm not sure it's the most common use of the template, and second because the STL is not the be-all and end-all of algorithms, especially where domain knowledge can be used to accelerate computation for special cases.
If it turns out that is_permutation for char is quite common in the wild, C++ library implementors would be within their rights to provide a specialization for it.
The answer you cite works on chars. It assumes they are 8 bit (not necessarily the case) and so there are only 256 possibilities for each value, and that you can cheaply go from each value to a numeric index to use for a lookup table of counts (for char in this case, the value and the index are the same thing!)
It generates a count of how many times each char value occurs in each string; then, if these distributions are the same for both strings, the strings are permutations of each other.
What is the time complexity?
you have to walk each character of each string, so M+N steps for two inputs of lengths M and N
each of these steps involves incrementing an count in a fixed size table at an index given by the char, so is constant time
So the overall time complexity is O(N+M): linear, as you describe.
Now, std::is_permutation makes no such assumptions about its input. It doesn't know that there are only 256 possibilities, or indeed that they are bounded at all. It doesn't know how to go from an input value to a number it can use as an index, never mind how to do that in constant time. The only thing it knows is how to compare two values for equality, because the caller supplies that information.
So, the time complexity:
we know it has to consider each element of each input at some point
we know that, for each element it hasn't seen before (I'll leave discussion of how that's determined and why that doesn't impact the big O complexity as an exercise), it's not able to turn the element into any kind of index or key for a table of counts, so it has no way of counting how many occurrences of that element exist which is better than a linear walk through both inputs to see how many elements match
so the complexity is going to be quadratic at best.
If it was just ASCII characters I just use array of bool of size 256. But Unicode has so many characters.
1. Wikipedia says unicode has more than 110000 characters. So bool[110000] might not be a good idea?
2. Lets say the characters are coming in a stream and I just want to stop whenever a duplicate is detected. How do I do this?
3. Since the set is so big, I was thinking hash table. But how do I tell when the collision happens because I do not want to continue once a collision is detected. Is there a way to do this in STL implementation of hash table?
Efficient in terms of speed and memory utilization.
Your Options
There are a couple of possible solutions:
bool[0x110000] (note that this is a hex constant, not a decimal constant as in the question)
vector<bool> with a size of 0x110000
sorted vector<uint32_t> or list<uint32_t> containing every encountered codepoint
map<uint32_t, bool> or unordered_map<uint32_t, bool> containing a mapping of codepoints to whether they have been encountered
set<uint32_t> or unordered_set<uint32_t> containing every encountered codepoint
A custom container, e.g. a bloom filter which provides high-density probabilistic storage for exactly this kind of problem
Analysis
Now, let's perform a basic analysis of the 6 variants:
Requires exactly 0x110000 bytes = 1.0625 MiB plus whatever overhead a single allocation does. Both setting and testing is extremely fast.
While it would seem that this is pretty exactly the same solution, it only requires roughly 1/8 of the memory, since it will store the bools in one bit each instead of one byte each. Both setting and testing are extremely fast, performance relative to the first solution may be better or worse, depending on things like cpu cache size, memory performance and of course your test data.
While potentially taking the least amount of memory (4 bytes per encountered codepoint, so it will require less memory as long as the input stream contains at most 0x110000 / 8 / 4 = 34816), performance for this solution will be abysmal: Testing takes O(log(n)) for the vector (binary search) and O(n) for the list (binary search requires random access), while inserting takes O(n) for the vector (all following elements must be moved) and O(1) for the list (assuming that you kept the result of your failed test). This means, that a test + insert cycles takes O(n). Therefore your total runtime will be O(n^2)...
Without doing a lot of talking about this, it should be obvious that we do not need the bool, but would rather just test for existence, leading to solution 5.
Both sets are fairly similar in performance, set is usually implemented with a binary tree and unordered_set with a hash map. This means, that both are fairly inefficient memory wise: Both contain additional overhead (non-leaf nodes in trees and the actual tables containing the hashes in hashmaps) meaning they will probably take 8-16 bytes per entry. Testing and Inserting are O(log(n)) for the set and O(1) amortized for the unordered_set.
To answer the comment, testing whether uint32 const x is contained in unordered_set<uint32_t> data is done like so: if(data.count(x)) or if(data.find(x) != data.end()).
The major drawbacks here is a significant amount of work that the developer has to invest. Also, the bloom filter that was given as an example is a probabilistic data structure, meaning false-positives are possible (false negatives are not in this specific case).
Conclusion
Since your test data is not actual textual data, using a set of either kind is actually very memory inefficient (with high probability). Since this is also the only possibility to achieve better performance than the naive solutions 1 and 2, it will in all probability be significantly slower as well.
Taking into consideration the fact that it is easier to handle and that you seem to be fairly conscious of memory consumption, the final verdict is that vector<bool> seems the most appropriate solution.
Notes
Unicode is intended to represent text. This means that your test case and any analysis following from this is probably highly flawed.
Also, while it is simple to detect duplicate code points, the idea of a character is far more ambiguous, and may require some kind of normalization (e.g. "ä" can be either one codepoint or an "a" and a diacritical mark, or even be based on the cyrillic "a").
I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.
I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.
The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.
Can anyone help me figure out how to write a quality hash function?
Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.
If storage is the issue, you can use one of the hash functions analyzed here.
Here's a fairly reasonable hash function from 4 integers to 1 integer:
unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];
With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.
There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.
Because hashing can generate collisions, you have to keep the keys in memory anyway in order to discover these collisions. Hashmaps and other standard datastructures do do this in their internal bookkeeping.
As the key is so small, just use the key directly rather than hashing. This will be faster and will ensure no collisions.
I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.
A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.
To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.
Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.
Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).
Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.
Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.