Qt - search string in 200k-words dictionary - c++

So, I have this text file (generated with Aspell) with 200 000 words in it. It's going to be used for a crabble game, to check if the word is legit. That means that, most likely, there's going to be quite a lot of checks where the word is not in there, and I was wondering what the most efficient way would be.
Checking the textfile line per line is going to take 200 000 iterations per check, so that'd be my last choice.
Getting all the words in a QList, and use the Qlist::contain() function (or QList::indexOf() since I think i'm using Qt4.8). I don't know about the efficiency of that though, and there's going to be quite a lot of memory used.
Using a hashtable. I honestly am not sure how that works, so if anyone could tell wether there are Qt data types provided, I can do some research.
Are there any other, efficient methods? Currently leaning to the QList method, seems easiest to implement :)

You can use std::unordered_set, it performs lookups via hash table.
Qt has it's own implementation of it QSet
Do not use QList or the first file traversal method, as both are orders of magnitude slower than one hashtable lookup.

Assuming the hash is good, using a hashtable will definitely be the fastest method (as it's a simple computation of the hash - since the string is probably not very long, that shouldn't take much time - typical English words are around 5 characters long).
There is an example in the QHash section of this page for how to hash a string: http://doc.qt.digia.com/qq/qq19-containers.html

Sort the list -- a one time operation: save it sorted, or sort it when starting your program -- and use a binary search. Looking up any word in 200,000 items is going to take on average 17.6 lookups, with about the four first operations only having to check a single character.

Related

chained hash table keys with universal hasing,does it need a rehash?

I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?
Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.
do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.

Building a set of unique lines in a file efficiently, without storing actual lines in the set

Recently I was trying to solve the following issue:
I have a very large file, containing long lines, and I need to find and print out all the unique lines in it.
I don't want to use a map or set storing the actual lines, as the file is very big and the lines are long, so this would result in O(N) space complexity with poor constants (where N is the number of lines). Preferably, I would rather generate a set storing the pointers to the lines in the files that are unique. Clearly, the size of such a pointer (8 bytes on 64 bit machine I believe) is generally much smaller than the size of line (1 byte per character I believe) in memory. Although space complexity is still O(N), the constants are much better now. Using this implementation, the file never needs to be fully loaded in memory.
Now, let's say I'll go through the file line by line, checking for uniqueness. To see if it is already in the set, I could compare to all lines pointed by the set so far, comparing character by character. This gives O(N^2*L) complexity, with L the average length of a line. When not caring about storing the full lines in the set, O(N*L) complexity can be achieved, thanks to hashing. Now, when using a set of pointers instead (to reduce space requirements), how can I still achieve this? Is there a neat way to do it? The only thing I can come up with is this approach:
Hash the sentence. Store the hash value to map (or actually: unordered_multimap unordered to get the hashmap style, multi as double keys may be inserted in case of 'false matches').
For every new sentence: check if its hash value is already in the map. If not, add it. If yes, compare the full sentences (new one and the one in the unordered map with same hash) character by character, to make sure there is no 'false match'. If it is a 'false match', still add it.
Is this the right way? Or is there a nicer way you could do it? All suggestions are welcome!
And can I use some clever 'comparison object' (or something like that, I don't know much about that yet) to make this checking for already existing sentences fully automated on every unordered_map::find() call?
Your solution looks fine to me since you are storing O(unique lines) hashes not N, so that's a lower bound.
Since you scan the file line by line you might as well sort the file. Now duplicate lines will be contiguous and you need only check against the hash of the previous line. This approach uses O(1) space but you've got to sort the file first.
As #saadtaame's answer says, your space is actually O(unique lines) - depending on your use case, this may be acceptable or not.
While hashing certainly has its merits, it can conceivably have many problems with collisions - and if you can't have false positives, then it is a no-go unless you actually keep the contents of the lines around for checking.
The solution you describe is to maintain a hash-based set. That is obviously the most straightforward thing to do, and yes it would require to maintain all the unique lines in memory. That may or may not be a problem, though. That solution would also the easiest to implement -- what you are trying to do is exactly what any implementation of a (hash-based) set would do. You can just use std::unordered_set, and add every line to the set.
Since we are throwing around ideas, you could also use a trie as a substitute for the set. You would maybe save some space, but it still would be O(unique lines).
If there isn't some special structure in the file you can leverage, then definitively go for hashing the lines. This will - by orders of magnitude - be faster than actually comparing each line against each other line in the file.
If your actual implementation is still too slow, you can e.g. limit the hashing to the first portion of each line. This will produce more false positives, but assuming, that most lines will deviate already in the first few words, it will significantly speed up the file processing (especially, if you are I/O-bound).

Best string search algorithm around

I have a code where in i compare a large data, say a source of a web page against some words in a file. What is the best algorithm to be used?
There can be 2 scenarios:
If I have a large amount of words to compare against the source, In which case, for a normal string search algorithm, it would have to take a word, compare against the data, take the next and compare against the data and so on until all is complete.
I have only a couple of words in the file and the normal string search would be ok, but still need to reduce the time as much as possible.
What algorithm is best? I know about Boyer-Moore and also Rabin-Karp search algorithms.
Although Boyer-Moore search seems fast, I would also like names of other algorithms and their comparisons.
In both cases, I think you probably want to construct a patricia trie (also called radix tree). Most importantly, lookup time would be O(k), where k is the max length of a string in the trie.
Note that Boyer-Moore is to search a text (several words) within a text.
If all you want is identifying some individual words, then it's much easier to:
put each searched word in a dictionary structure (whatever it is)
look-up each word of the text in the dictionary
This most notably mean that you read the text as a stream, and need not hold it all in memory at once (which works great with the typical example of a file cursor).
As for the structure of the dictionary, I would recommend a simple hash table. Works great memory-wise compared to tree structures.

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.