Implementation of a locally ordered set or priority queue? - c++

I have a rather large set of objects that represent numbers and I want to select such numbers according to a custom ordering. This ordering includes several criteria such as the type of their representation (some numbers are represented by an interval), their integrality and ultimatively their value. These numbers are shared throughout the programs (shared pointers) and there is nothing I can do about this.
However, the elements properties can change at any time such that the order changes while I can't notify the container about this. For example, some operations require a refinement of a number that is represented by an interval and during this refinement, the exact value can be found. Thereby, the number changes from the interval representation to a rational number, possibly even an integer. This change, due to the shared instance, immediately propagates to the number in the container and breaks the ordering (and I don't even know which number changed). This totally breaks std::set.
So what I'd like to have is a container that tries to be sorted, but does not rely on this. Whenever an operation detects an incorrect ordering, this ordering should be corrected locally. For example insert would insert the element (using binary search) and always check if the ordering of the current element (w.r.t. the neighbors) is correct.
I'd be willing to accept that "give me the smallest element" would then be only "give me a small element" and that find or remove would have linear complexity: I only need front, insert and remove_front to be particularly efficient.
Is there any implementation that does something like this?
How would you implement this?

If you are looking for an algorithm in the standard library, you should take a look at:
std::make_heap
std::pop_heap
std::push_heap
In <algorithm>. They might fit your need, and even if they don't I'm quite sure you will find what you are looking for in some kind of heap structure. Which one will probably depend on how your code is structured, and how often you expect a value to change etc.
In short:
A heap is a data structure in which it is fast to find and extract the smallest (or largest) element. It is also for most heaps possible to create restructure the heap in linear time or better. You could start out from this page on Wikipedia if you want to learn more about heaps.

Related

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

How can we benefit from vs2010 hash_map's less?

See this if you don't know vs2010 actually requires total ordering, and hence it require a user defined less.
one of answer said it is possible for binary search, but I don't think so, this because
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
Obviously, it will slow down insertion because of locating the appropriate position.
How does hash-map benefit from this design? and how do we utilize this design?
thanks
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
There won't be at most one element per hash slot. Some buckets will have to keep more than one key. Unless the input is only from a pre-determined restricted set of values (i.e. perfect hashing), the hash function will have to deal with more inputs than the outputs that it can produce. There will be collisions; this is unavoidable in an implementation as generic as this one. However, good hash functions should produce well-distributed hashes and that makes the number of elements per hash slot stay low.
Obviously, it will slow down insertion because of locating the appropriate position.
Assuming a good hash function and non-degenerate input (input designed so that many elements result in the same hash), there will always be only a few keys per bucket. Inserting into such a binary search tree won't be that big of a cost, and that little cost may bring benefits elsewhere (searches may be faster than on an implementation with a linked list). And in case of degenerate input, the hash map will degenerate into a binary search tree, which is much better than a simple linked list.
Your question is largely irrelevant in practice, because C++ now supplies unordered_map etc. which use an Equal predicate rather than a less-than comparator.
However, consider a hash_map<string, ...>. Clearly, the value space of string is larger than that of size_t, so for any hash function there will be values that have the same hash and so are placed in the same bucket. In the pathological situation where all the items in the hash table are placed in the same bucket, exploiting ordering among keys will result in improved speed of access, insertion and removal.
Note that search on an ordered list (or binary tree) is O(log n) as opposed to O(n).

Caching of floating point values in C++

I would like to assign a unique object to a set of floating point values. Doing so, I am exploring two different options:
The first option is to maintain a static hash map (std::unordered_map<double,Foo*>) in the class and to avoid that duplicates are created in the first place. This means that instead of calling the constructor, I will check if the value is already in the hash and if so, reuse this. I would also need to remove the value from the hash map in the destructor.
The second option is to allow duplicate values during creation, only to try to sort them all at once and detect duplicates after all values have been created. I guess I would need hash maps for that sorting as well. Or would an ordered map ('std::map) work just as well then?
Is there some reason to expect that the first option (which I like better) would be considerably slower in any situation? That is, would finding duplicate entries be much faster if I perform it all entries at once rather than one entry at a time?
I am aware of the pitfalls when cashing floating point numbers and will prevent not-a-numbers and infinities to be added to the map. Some duplicate entries for the same constant is also not a problem, should this occur for a few entries - it will only result in a very small speed penalty.
Depending on the source and the possible values of the floating point
numbers, a bigger problem might be defining a hash function which
respects equality. (0, Inf and NaN are the problem values—most
floating point formats have two representations for 0, +0.0 and
-0.0, which compare equal; I think the same thing holds for Inf. And
two NaN always compare unequal, even when they have exactly the same bit
pattern.)
Other than that, in all questions of performance, you have to measure.
You don't indicate how big the set is likely to become. Unless it is
enormous, if all values are inserted up front, the fastest solution is
often to use push_back on an std::vector, then std::sort and, if
desired, std::unique after the vector has been filled. In many
cases, using an std::vector and keeping it sorted is faster even when
insertions and removals are frequent. (When you get a new request, use
std::lower_bound to find the entry point; if the value at the location
found is not equal, insert a new entry at that point.) The improved
locality of std::vector largely offsets any additional costs due to
moving the objects during insertion and deletion, and often even the
fact that access is O(lg n) rather than O(1). (In one particular case,
I found that the break even point between a hash table and as sorted
std::vector was around 100,000 entries.)
Have you considered actually measuring it?
None of us can tell you how the code you're considering will actually perform. Write the code, compile it, run it and measure how fast it runs.
Spending time trying to predict which solution will be faster is (1) a waste of your time, and (2) likely to yield incorrect results.
But if you want an abstract answer, it is that it depends on your use case.
If you can collect all the values, and sort them once, that can be done in O(n lg n) time.
If you insert the elements one at a time into a data structure with the performance characteristics of std::map, then each insertion will take O(lg n) time, and so, performing n insertions will also take O(n lg n) time.
Inserting into a hash map (std::unordered_map) takes constant time, and so n insertions can be done in O(n). So in theory, for sufficiently large values of n, a hash map will be faster.
In practice, in your case, no one knows. Which is why you should measure it if you're actually concerned about performance.

Choosing N random numbers from a set

I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?
You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.
I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.
A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.