Best way to Store and Search Numbers in C++ - c++

I have a very large array storing some numbers. My task is to find if a particular number exists in array or not efficiently. Which algorithm and data structure I should go with?
Few assumptions:
Each number in array would be unique.
I am not concerned about where the data is found in array I just want to return true if data is found else false.
I would be using C++ as programming language.
Please suggest.
Thanks

Constant time lookup with unordered_set.
There are also options of bitsets etc. Depends exactly how large is "very large" and the sparseness of the values stored compared to how many of them there actually are.

seems unordered_set is suitable for your requirement.
PS: Pls remember all elements in this set are immutable

The known best way to check if an element (number) is a member of a set (array) is to use bloom filters. It works well if set is changing over time or if there are set operations among sets. Bloom filters are easy to implement and good implementations are availble
If set is static (i.e. never change) the good way is to use perfect hash function. It will take time to build, but will outperform usual hash function provided by std::unordered_set

Related

Preferred way to test against many discrete values?

I have the following scenario:
variable in {12, 4, 999, ... }:
Where there are about 100 discrete values in the list. I am writing a parser to convert this to C++, and the only ways that I can think of to do it are 100 case statements, or 100 if ==
Is one preferred to the other, or is there an all round better way to do this?
I should clarify, the values are constant integers. Thanks
If the maximum value of any one of your discrete values is small enough a std::vector<bool> of flags set true or false depending on whether that entry is in the list should be pretty optimal - assuming the values occur with approximately equal probabilility.
One way is to arrange the values in order and use binary search to check whether a value is contained in your collection.
You can either put your values in a vector in sorted order using std::lower_bound for the insertion point and then use std::binary_search to test for membership, or you can put your values in an std::set and get that feature for free (using std::set::find() for membership testing).
There are minor performance considerations that may make either option preferable; profile and decide for yourself.
A second approach is to put your values in a hash table such as std::unordered_set (or some kind of static equivalent if your values are known statically).
Assuming the values are constants, you can certainly use a switch statement. The compiler will do this pretty efficiently, using either a binary search type approach or a table [or a combination of table and binary search]. A long list of if-statements will not be as efficient, unless you sort the numbers and make a binary search type approach - a switch-statement is much easier to generate, as the compiler will sort out the best approach to decide what numbers are in the list and which ones aren't.
If the values are not constants, then a switch-statement is obviously not a solution. A bitmap may work - again, depending on the actual range - of the values are a large range, then that's not a good solution, since it will use a lot of memory [but it probably is one of the fastest methods, since it's just a case of dividing/modulo with a 2^n number, which can be done with simple >> and & operators, followed by one memory read].

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

non STL hash table type structure

Is there a way to write simple hashtable with the key as "strings" and value as the frequency, so that there are NO collisons? There will no be removal from the hashtable, and if the object already exists in the hashtable, then just update its frequency(add them together).
I was thinking there might be a algorithm that can compute a unique number from the string which will be used as the index.
Yes, i am avoiding the use of all STL construct including unordered_map.
You can use any perfect hash generator like gperf
See here for a list: http://en.wikipedia.org/wiki/Perfect_hash_function
PS. You'd still possibly want to use a map instead of flat array/vector in case the mapped domain gets too big/sparse
It really depends on what you mean by 'simple'.
The std::map is a fairly simple class. Still, it uses a red-black tree with all of the insertion, deletion, and balancing nicely hidden away, and it is templated to handle any orderable type as a key and any type as the value. Most map classes use a similar implementation, and avoid any sort of hashing functionality.
Hashing without collisions is not a trivial matter whatsoever. Perhaps the simplest method is Pearson Hashing.
It seems like you have 3 choices:
Implement your own perfect hashing class. This would be a pretty good sized class with a lot of functionality and some decently complex algorithms. I don't think this is simple.
Download and use a perfect hashing library that is already out there. Of course, you have to worry about deployability.
Use STL's map class. It's embedded, well-documented, easy to use, type-flexible, and completely cross-platform. This seems like the 'simplest' solution.
If I may ask, Why are you avoiding STL?
If the set of possible strings is known beforehand, you can use a perfect hash function generator to do this. But otherwise, what you ask is impossible.
Now, it IS possible to make the likelihood of collisions extremely low by using a good hash function and making sure your table is huge. You basically need a big enough table to make the likelihood of invoking the Birthday Paradox low enough to suit you. Then you just use n bits of output from SHA-1, and 2^n will be your table size.
I'm also wondering if maybe you could use a Bloom filter and have an actual counter instead of bits. Keep a list of all the words you've stuffed into the bloom filter and what entries they've incremented (which will be the same each time) and you have yourself a gigantic linear function that you might be able to solve to get all the individual counts back out again.

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.