I have the following scenario:
variable in {12, 4, 999, ... }:
Where there are about 100 discrete values in the list. I am writing a parser to convert this to C++, and the only ways that I can think of to do it are 100 case statements, or 100 if ==
Is one preferred to the other, or is there an all round better way to do this?
I should clarify, the values are constant integers. Thanks
If the maximum value of any one of your discrete values is small enough a std::vector<bool> of flags set true or false depending on whether that entry is in the list should be pretty optimal - assuming the values occur with approximately equal probabilility.
One way is to arrange the values in order and use binary search to check whether a value is contained in your collection.
You can either put your values in a vector in sorted order using std::lower_bound for the insertion point and then use std::binary_search to test for membership, or you can put your values in an std::set and get that feature for free (using std::set::find() for membership testing).
There are minor performance considerations that may make either option preferable; profile and decide for yourself.
A second approach is to put your values in a hash table such as std::unordered_set (or some kind of static equivalent if your values are known statically).
Assuming the values are constants, you can certainly use a switch statement. The compiler will do this pretty efficiently, using either a binary search type approach or a table [or a combination of table and binary search]. A long list of if-statements will not be as efficient, unless you sort the numbers and make a binary search type approach - a switch-statement is much easier to generate, as the compiler will sort out the best approach to decide what numbers are in the list and which ones aren't.
If the values are not constants, then a switch-statement is obviously not a solution. A bitmap may work - again, depending on the actual range - of the values are a large range, then that's not a good solution, since it will use a lot of memory [but it probably is one of the fastest methods, since it's just a case of dividing/modulo with a 2^n number, which can be done with simple >> and & operators, followed by one memory read].
Related
I have a very large array storing some numbers. My task is to find if a particular number exists in array or not efficiently. Which algorithm and data structure I should go with?
Few assumptions:
Each number in array would be unique.
I am not concerned about where the data is found in array I just want to return true if data is found else false.
I would be using C++ as programming language.
Please suggest.
Thanks
Constant time lookup with unordered_set.
There are also options of bitsets etc. Depends exactly how large is "very large" and the sparseness of the values stored compared to how many of them there actually are.
seems unordered_set is suitable for your requirement.
PS: Pls remember all elements in this set are immutable
The known best way to check if an element (number) is a member of a set (array) is to use bloom filters. It works well if set is changing over time or if there are set operations among sets. Bloom filters are easy to implement and good implementations are availble
If set is static (i.e. never change) the good way is to use perfect hash function. It will take time to build, but will outperform usual hash function provided by std::unordered_set
I would like to assign a unique object to a set of floating point values. Doing so, I am exploring two different options:
The first option is to maintain a static hash map (std::unordered_map<double,Foo*>) in the class and to avoid that duplicates are created in the first place. This means that instead of calling the constructor, I will check if the value is already in the hash and if so, reuse this. I would also need to remove the value from the hash map in the destructor.
The second option is to allow duplicate values during creation, only to try to sort them all at once and detect duplicates after all values have been created. I guess I would need hash maps for that sorting as well. Or would an ordered map ('std::map) work just as well then?
Is there some reason to expect that the first option (which I like better) would be considerably slower in any situation? That is, would finding duplicate entries be much faster if I perform it all entries at once rather than one entry at a time?
I am aware of the pitfalls when cashing floating point numbers and will prevent not-a-numbers and infinities to be added to the map. Some duplicate entries for the same constant is also not a problem, should this occur for a few entries - it will only result in a very small speed penalty.
Depending on the source and the possible values of the floating point
numbers, a bigger problem might be defining a hash function which
respects equality. (0, Inf and NaN are the problem values—most
floating point formats have two representations for 0, +0.0 and
-0.0, which compare equal; I think the same thing holds for Inf. And
two NaN always compare unequal, even when they have exactly the same bit
pattern.)
Other than that, in all questions of performance, you have to measure.
You don't indicate how big the set is likely to become. Unless it is
enormous, if all values are inserted up front, the fastest solution is
often to use push_back on an std::vector, then std::sort and, if
desired, std::unique after the vector has been filled. In many
cases, using an std::vector and keeping it sorted is faster even when
insertions and removals are frequent. (When you get a new request, use
std::lower_bound to find the entry point; if the value at the location
found is not equal, insert a new entry at that point.) The improved
locality of std::vector largely offsets any additional costs due to
moving the objects during insertion and deletion, and often even the
fact that access is O(lg n) rather than O(1). (In one particular case,
I found that the break even point between a hash table and as sorted
std::vector was around 100,000 entries.)
Have you considered actually measuring it?
None of us can tell you how the code you're considering will actually perform. Write the code, compile it, run it and measure how fast it runs.
Spending time trying to predict which solution will be faster is (1) a waste of your time, and (2) likely to yield incorrect results.
But if you want an abstract answer, it is that it depends on your use case.
If you can collect all the values, and sort them once, that can be done in O(n lg n) time.
If you insert the elements one at a time into a data structure with the performance characteristics of std::map, then each insertion will take O(lg n) time, and so, performing n insertions will also take O(n lg n) time.
Inserting into a hash map (std::unordered_map) takes constant time, and so n insertions can be done in O(n). So in theory, for sufficiently large values of n, a hash map will be faster.
In practice, in your case, no one knows. Which is why you should measure it if you're actually concerned about performance.
I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/
In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.
I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?
You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.
I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.
A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.