Caching of floating point values in C++

Caching of floating point values in C++ - c++

I would like to assign a unique object to a set of floating point values. Doing so, I am exploring two different options:
The first option is to maintain a static hash map (std::unordered_map<double,Foo*>) in the class and to avoid that duplicates are created in the first place. This means that instead of calling the constructor, I will check if the value is already in the hash and if so, reuse this. I would also need to remove the value from the hash map in the destructor.
The second option is to allow duplicate values during creation, only to try to sort them all at once and detect duplicates after all values have been created. I guess I would need hash maps for that sorting as well. Or would an ordered map ('std::map) work just as well then?
Is there some reason to expect that the first option (which I like better) would be considerably slower in any situation? That is, would finding duplicate entries be much faster if I perform it all entries at once rather than one entry at a time?
I am aware of the pitfalls when cashing floating point numbers and will prevent not-a-numbers and infinities to be added to the map. Some duplicate entries for the same constant is also not a problem, should this occur for a few entries - it will only result in a very small speed penalty.

Depending on the source and the possible values of the floating point
numbers, a bigger problem might be defining a hash function which
respects equality. (0, Inf and NaN are the problem values—most
floating point formats have two representations for 0, +0.0 and
-0.0, which compare equal; I think the same thing holds for Inf. And
two NaN always compare unequal, even when they have exactly the same bit
pattern.)
Other than that, in all questions of performance, you have to measure.
You don't indicate how big the set is likely to become. Unless it is
enormous, if all values are inserted up front, the fastest solution is
often to use push_back on an std::vector, then std::sort and, if
desired, std::unique after the vector has been filled. In many
cases, using an std::vector and keeping it sorted is faster even when
insertions and removals are frequent. (When you get a new request, use
std::lower_bound to find the entry point; if the value at the location
found is not equal, insert a new entry at that point.) The improved
locality of std::vector largely offsets any additional costs due to
moving the objects during insertion and deletion, and often even the
fact that access is O(lg n) rather than O(1). (In one particular case,
I found that the break even point between a hash table and as sorted
std::vector was around 100,000 entries.)

Have you considered actually measuring it?
None of us can tell you how the code you're considering will actually perform. Write the code, compile it, run it and measure how fast it runs.
Spending time trying to predict which solution will be faster is (1) a waste of your time, and (2) likely to yield incorrect results.
But if you want an abstract answer, it is that it depends on your use case.
If you can collect all the values, and sort them once, that can be done in O(n lg n) time.
If you insert the elements one at a time into a data structure with the performance characteristics of std::map, then each insertion will take O(lg n) time, and so, performing n insertions will also take O(n lg n) time.
Inserting into a hash map (std::unordered_map) takes constant time, and so n insertions can be done in O(n). So in theory, for sufficiently large values of n, a hash map will be faster.
In practice, in your case, no one knows. Which is why you should measure it if you're actually concerned about performance.

Related

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?

I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

Implementation of a locally ordered set or priority queue?

I have a rather large set of objects that represent numbers and I want to select such numbers according to a custom ordering. This ordering includes several criteria such as the type of their representation (some numbers are represented by an interval), their integrality and ultimatively their value. These numbers are shared throughout the programs (shared pointers) and there is nothing I can do about this.
However, the elements properties can change at any time such that the order changes while I can't notify the container about this. For example, some operations require a refinement of a number that is represented by an interval and during this refinement, the exact value can be found. Thereby, the number changes from the interval representation to a rational number, possibly even an integer. This change, due to the shared instance, immediately propagates to the number in the container and breaks the ordering (and I don't even know which number changed). This totally breaks std::set.
So what I'd like to have is a container that tries to be sorted, but does not rely on this. Whenever an operation detects an incorrect ordering, this ordering should be corrected locally. For example insert would insert the element (using binary search) and always check if the ordering of the current element (w.r.t. the neighbors) is correct.
I'd be willing to accept that "give me the smallest element" would then be only "give me a small element" and that find or remove would have linear complexity: I only need front, insert and remove_front to be particularly efficient.
Is there any implementation that does something like this?
How would you implement this?

If you are looking for an algorithm in the standard library, you should take a look at:
std::make_heap
std::pop_heap
std::push_heap
In <algorithm>. They might fit your need, and even if they don't I'm quite sure you will find what you are looking for in some kind of heap structure. Which one will probably depend on how your code is structured, and how often you expect a value to change etc.
In short:
A heap is a data structure in which it is fast to find and extract the smallest (or largest) element. It is also for most heaps possible to create restructure the heap in linear time or better. You could start out from this page on Wikipedia if you want to learn more about heaps.

How to make a fast dictionary that contains another dictionary?

I have a map<size_t, set<size_t>>, which, for better performance, I'm actually representing as a lexicographically-sorted vector<pair<size_t, vector<size_t>>>.
What I need is a set<T> with fast insertion times (removal doesn't matter), where T is the data type above, so that I can check for duplicates (my program runs until there are no more unique T's being generated.).
So far, switching from set to unordered_set has turned out to be quite beneficial (it makes my program run > 25% faster), but even now, inserting T still seems to be one of the main bottlenecks.
The maximum number of integers in a given T is around ~1000, and each integer is also <= ~1000, so the numbers are quite small (but there are thousands of these T's being generated).
What I have already tried:
Using unsigned short. It actually decreases performance slightly.
Using Google's btree::btree_map.
It's actually much slower because I have to work around the iterator invalidation.
(I have to copy the keys, and I think that's why it turned out slow. It was at least twice as slow.)
Using a different hash function. I haven't found any measurable difference as long as I use something reasonable, so it seems like this can't be improved.
What I have not tried:
Storing "fingerprints"/hashes instead of the actual sets.
This sounds like the perfect solution, except that the fingerprinting function needs to be fast, and I need to be extremely confident that collisions won't happen, or they'll screw up my program.
(It's a deterministic program that needs exact results; collisions render it useless.)
Storing the data in some other compact, CPU-friendly way.
I'm not sure how beneficial this would be, because it might involve copying around data, and most of the performance I've gained so far is by (cleverly) avoiding copying data in many situations.
What else can I do to improve the speed, if anything?

I am under the impression that you have 3 different problems here:
you need the T itself to be relatively compact and easy to move around
you need to quickly check whether a T is a possible duplicate of an already existing one
you finally need to quickly insert the new T in whatever data structure you have to check for duplicates
Regarding T itself, it is not yet as compact as it could be. You could probably use a single std::vector<size_t> to represent it:
N pairs
N Indexes
N "Ids" of I elements each
all that can be linearized:
[N, I(0), ..., I(N-1),
R(0) = Size(Id(0)), Id(0, 0), ... , Id(0, R(0)-1),
R(1) = ... ]
and this way you have a single chunk of memory.
Note: depending on the access pattern you may have to tweak it, specifically if you need random access to any ID.
Regarding the possibility of duplicates, a hash-map seems indeed quite appropriate. You will need a good hash function, but with a single array of size_t (or unsigned short if you can, it is smaller), you can just pick MurmurHash or CityHash or SipHash. They all are blazing fast and do their damnest to produce good quality hash (not cryptographic ones, emphasis is on speed).
Now, the question is when is it slow when checking for duplicates.
If you spend too much time checking for non-existing duplicates because the hash-map is too big, you might want to invest in a Bloom Filter in front of it.
Otherwise, check your hash function to make sure that it is really fast and has a low collision rate and your hash-map implementation to make sure it only ever computes the hash once.
Regarding insertion speed. Normally a hash-map, specifically if well-balanced and pre-sized, should have one of the quickest insertion. Make sure you move data into it and do not copy it; if you cannot move, it might be worth using a shared_ptr to limit the cost of copying.

Don't be afraid of collisions, use cryptographic hash. But choose a fast one. 256 bit collision is MUCH less probable than hardware error. Sun used it to deduplicate blocks in ZFS. ZFS uses SHA256. Probably you can use less secure hash. If it takes $1000000 to find one collision hash isn't secure but one collision don't seem to drop your performance. Many collisions would cost many $1000000. You can use something like (unordered) multimap<SHA, T> to deal with collisions. By the way, ANY hash table suffer from collisions (or takes too many memory), so ordered map (rbtree in gcc) or btree_map has better guaranteed time. Also hash table can be DOSed via hash collisions. Probably a secret salt can solve this problem. It is due to table size is much less than number of possible hashes.

You can also:
1) use short ints
2) reinterpret your array as an array of something like uint64_t for fast comparison (+some trailing elements), or even reinterpret it as an array of 128-bit values (or 256-bit, depending on your CPU) and compare via SSE. This should push your performance to memory speed limit.
From my experience SSE works fast with aligned memory access only. uint64_t comparison probably needs alignment for speed too, so you have to allocate memory manually with proper alignment (allocate more and skip first bytes). tcmalloc is 16 byte-aligned, uint64_t-ready. It is strange that you have to copy keys in btree, you can avoid it using shared_ptr. With fast comparisons and slow hash btree or std::map may turn out to be faster than hash table. I guess any hash is slower than memory. You can also calculate hash via SSE and probably find a library that does it.
PS I strongly recommend you to use profiler if you don't yet. Please tell % of time your program spend to insert, compare in insert and calculate hash.

How can we benefit from vs2010 hash_map's less?

See this if you don't know vs2010 actually requires total ordering, and hence it require a user defined less.
one of answer said it is possible for binary search, but I don't think so, this because
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
Obviously, it will slow down insertion because of locating the appropriate position.
How does hash-map benefit from this design? and how do we utilize this design?
thanks

The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
There won't be at most one element per hash slot. Some buckets will have to keep more than one key. Unless the input is only from a pre-determined restricted set of values (i.e. perfect hashing), the hash function will have to deal with more inputs than the outputs that it can produce. There will be collisions; this is unavoidable in an implementation as generic as this one. However, good hash functions should produce well-distributed hashes and that makes the number of elements per hash slot stay low.
Obviously, it will slow down insertion because of locating the appropriate position.
Assuming a good hash function and non-degenerate input (input designed so that many elements result in the same hash), there will always be only a few keys per bucket. Inserting into such a binary search tree won't be that big of a cost, and that little cost may bring benefits elsewhere (searches may be faster than on an implementation with a linked list). And in case of degenerate input, the hash map will degenerate into a binary search tree, which is much better than a simple linked list.

Your question is largely irrelevant in practice, because C++ now supplies unordered_map etc. which use an Equal predicate rather than a less-than comparator.
However, consider a hash_map<string, ...>. Clearly, the value space of string is larger than that of size_t, so for any hash function there will be values that have the same hash and so are placed in the same bucket. In the pathological situation where all the items in the hash table are placed in the same bucket, exploiting ordering among keys will result in improved speed of access, insertion and removal.
Note that search on an ordered list (or binary tree) is O(log n) as opposed to O(n).

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?

Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.

I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.

For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.

You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.

A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.

Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.

Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?

How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.

Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js