Alternative to vectors for large data sets? C++

Alternative to vectors for large data sets? C++ - c++

I am looking for a data structure that holds data in order that its inserted (like a vector) that needs to hold millions of unsigned longs. The key is that it needs to have a lookup thats better than O(logn), because it will get searched against a similar vector of the same size. Is there something that exists like this?
If I insert 10, 20, 30 and then iterate over the set, I need to guarantee the order of 10, 20, 30. My data is a string I converted into a unsigned long to reduce memory use, that is reverse decodable.
EDIT:
Since people are asking, I am comparing two vectors against each other (both very large in size) to get the difference.
Small Example:
vector 1: 10 20 30 40 50 60
vector 2: 11 24 30 40 55 70 90
result: 30 40

I never used it myself and it might be out-of-date compared to recent C++ version features (last update is from 2011), but STXXL is meant to be a set of containers and algorithms built for very big amount of data.
It might fit your need.
The core of STXXL is an implementation of the C++ standard template
library STL for external memory (out-of-core) computations, i. e.,
STXXL implements containers and algorithms that can process huge
volumes of data that only fit on disks. While the closeness to the STL
supports ease of use and compatibility with existing applications,
another design priority is high performance.
The key features of STXXL are:
Transparent support of parallel disks. The library provides implementations of basic parallel disk algorithms. STXXL is the only
external memory algorithm library supporting parallel disks.
The library is able to handle problems of very large size (tested to up to dozens of terabytes).
Improved utilization of computer resources. STXXL implementations of external memory algorithms and data structures benefit from
overlapping of I/O and computation.
Small constant factors in I/O volume. A unique library feature called "pipelining" can save more than half the number of I/Os, by
streaming data between algorithmic components, instead of temporarily
storing them on disk. A development branch supports asynchronous
execution of the algorithmic components, enabling high-level task
parallelism.
Shorter development times due to well known STL-compatible interfaces for external memory algorithms and data structures.
STL algorithms can be directly applied to STXXL containers; moreover, the I/O complexity of the algorithms remains optimal in most
of the cases.
For internal computation, parallel algorithms from the MCSTL or the
libstdc++ parallel mode are optionally utilized, making the algorithms
inherently benefit from multi-core parallelism.

A hash map is one way you will have faster lookup than a sorted vector. You must have c++11 support to use it.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
To preserve the order of the data the only way would be to maintain a vector beside it that stored the int's as well
Before you jump to using it you should consider how you are going to use this data structure (access pattern). Also consider what the data you will be getting is likely to be.
Here is boost's version of the same thing http://www.boost.org/doc/libs/1_53_0/doc/html/unordered.html

I think what you should use is unordered_map combined with maybe a doubly-linked list for the order.
So every time you add a new item to your database you add it first to the front (or the end) of the linked list, and then add it to the hashmap where the key is the value (the unsigned int) and the "value" (from the key/value pair) is the pointer to the the object in the linked list.
So now if you want a fast lookup you look in the hashmap, and if you want to iterate by order you use the linked list.
Of course when you want to remove an object you have to remove them from both, but complexity wise it's the same (O(1) amortized for everything).
This will of course increase your memory by 2 or 3 compared to just using a hashmap.

Related

Deciding when to use a hash table

I was soving a competitive programming problem with the following requirements:
I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.
My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.
I also know for a fact that i wouldn't be doing more than 500 lookups.
So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?

My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)
If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.
Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.
Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).
In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.
Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.
IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.
If you really care that much, you cannot avoid benchmarking.
Notice that 500 pairs of integers is probably fitting into the L1 cache!

My rule of thumb is to assume the processor can deal with 10^9 operations per second.
In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.

You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.
A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.
You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/

How about storing the array indices in a map

my program is using boost::unordered_map a lot, and the map has about 40 million entries. This program doesn't do insertion or deletion very often. It just randomly accesses entries using keys.
I'm wondering will it improve the performance (in terms of the speed of accessing entries) if I store my entry values (about 1 KB each) in a flat array (maybe an std::vector), and I use boost::unordered_map to store the mapping of keys to the indices to this array.
Thanks,
Cui

Yes, that could seriously speed up things. In fact, that's what Boost flat_map is for :)
The docs relate: Non-standard containers
Using sorted vectors instead of tree-based associative containers is a well-known technique in C++ world. Matt Austern's classic article Why You Shouldn't Use set, and What You Should Use Instead (C++ Report 12:4, April 2000, PDF) was enlightening:
...
This gives you more than you asked for because you don't even need the extraneous index. This gives you more locality of reference and a lower memory footprint. Most importantly, it gives you lower complexity (-> fewer bugs) and a drop-in replacement for std::[unordered_]map in terms of interface.

Storing the values in contiguous memory like std::vector provides, will increase cache locality. This can make a pretty big difference in performance but it depends on the access pattern.
If your hunting performance, remember the golden rule:
always measure!

Why does CouchDB use an append-only B+ tree and not a HAMT

I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.
The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.
However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.
Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.
What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.
So why do CouchDB and practically every other database and filesystem use B trees?
[edit] fractal trees? log-structured merge tree? mind = blown
[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.

B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.
HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.
Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while B-trees are most often used with secondary storage, where page sizes are related to IO and VM parameters.
Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs B-tree - at the cost of higher worst case read latency.
LSM also offers the promise of a wider-performance envelope. Putting new data into its proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a B-tree and with 100%-deferral is a log.
However, LSM is more of a "family" of algorithms than a specific algorithm like a B-tree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.
Another approach to achieving write-throughput efficiency is to write-optimize B-trees through write-deferral, while attempting to maintain their optimal read-throughput.
Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between B-tree and LSM. Rather than deferring writes to an offline process, they amortize write-deferral in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-deferral performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for B-trees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)

Btrees are ordered by their key while in a hash map similar keys have very different hash values so are stored far each other. Now think of a query that do a range scan "give me yesterday's sales": with a hash map you have to scan all the map to find them, with a btree on the sales_dtm columns you'll find them nicely clustered and you exactly know where to start and stop reading.

std::map vs. self-written std::vector based dictionary

I'm building a content storage system for my game engine and I'm looking at possible alternatives for storing the data. Since this is a game, it's obvious that performance is important. Especially considering various entities in the engine will be requesting resources from the data structures of the content manager upon their creation. I'd like to be able to search resources by a name instead of an index number, so a dictionary of some sort would be appropriate.
What are the pros and cons to using an std::map and to creating my own dictionary class based on std::vector? Are there any speed differences (if so, where will performance take a hit? I.e. appending vs. accessing) and is there any point in taking the time to writing my own class?
For some background on what needs to happen:
Writing to the data structures occurs only at one time, when the engine loads. So no writing actually occurs during gameplay. When the engine exits, these data structures are to be cleaned up. Reading from them can occur at any time, whenever an entity is created or a map is swapped. There can be as little as one entity being created at a time, or as many as 20, each needing a variable number of resources. Resource size can also vary depending on the size of the file being read in at the start of the engine, images being the smallest and music being the largest depending on the format (.ogg or .midi).

Map: std::map has guaranteed logarithmic lookup complexity. It's usually implemented by experts and will be of high quality (e.g. exception safety). You can use custom allocators for custom memory requirements.
Your solution: It'll be written by you. A vector is for contiguous storage with random access by position, so how will you implement lookup by value? Can you do it with guaranteed logarithmic complexity or better? Do you have specific memory requirements? Are you sure you can implement a the lookup algorithm correctly and efficiently?
3rd option: If you key type is string (or something that's expensive to compare), do also consider std::unordered_map, which has constant-time lookup by value in typical situations (but not quite guaranteed).

If you want the speed guarantee of std::map as well as the low memory usage of std::vector you could put your data in a std::vector, std::sort it and then use std::lower_bound to find the elements.

std::map is written with performance in mind anyway, whilst it does have some overhead as they have attempted to generalize to all circumstances, it will probably end up more efficient than your own implementation anyway. It uses a red-black binary tree, giving all of it's operations O[log n] efficiency (aside from copying and iterating for obvious reasons).
How often will you be reading/writing to the map, and how long will each element be in it? Also, you have to consider how often will you need to resize etc. Each of these questions is crucial to choosing the correct data structure for your implementation.
Overall, one of the std functions will probably be what you want, unless you need functionality which is not in a single one of them, or if you have an idea which could improve on their time complexities.
EDIT: Based on your update, I would agree with Kerrek SB that if you're using C++0x, then std::unordered_map would be a good data structure to use in this case. However, bear in mind that your performance can degrade to linear time complexity if you have conflicting hashes (this cannot happen with std::map), as it will store the two pair's in the same bucket. Whilst this is rare, the probability of it obviously increases with the number of elements. So if you're writing a huge game, it's possible that std::unordered_map could become less optimal than std::map. Just a consideration. :)

How to implement Radix sort on multi-GPU?

How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?

That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.

Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js