sparse vector in C++? [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I have some code, using the class vector, which I want to implement with a vector that implements sparse vectors (i.e. instead of recording the elements in an array of the vector's length, including 0's, it would only include the non-zero tables in a look-up table).
Is there any sparse vector class in C++ that makes use of the same interface that vector does? (that will make refactoring much easier.)

Brendan is right to observe that logically a vector provides a map from index to value. An std::vector accomplishes this mapping with a simple array. But there are other options.
std::unordered_map has amortized O(1) time operations and seems like a natural alternative.
std::map has O(logn) operations, but the constants are smaller and there are more years of optimizations behind this. In practice it may be faster depending on your application.
SparseHash has an STL-compatible hashtable implementation that claims to be better than a standard unordered_map.
C++ BTree again offers an STL-compatible map, but one that uses btrees instead of binary trees. They claim significantly improved memory (50-80%) and time.
BitMagic offers an implementation of a sparse bit vector. Think a sparse std::bitset. If this fits your needs it offers really significant improvements over all the other approaches.
Finally the classical approach to a sparse vector is to use two vectors, one for the index and one for the values. So you have an std::vector<uint> ind; and a std::vector<value_type> val;.
None of these have exactly the same interface as a std::vector, but the differences are pretty small and you could easily write a small wrapper. For example, for the map classes you would want to keep track of the size and overload size() to return that number instead of the number of non-empty elements. Indeed, Boost's mapped_vector that Brendan links to does exactly this: it wraps a map-like class in a vector-like interface.
A drop-in replacement that works in all cases is impossible (because a std::vector is in nearly all cases assumed to degenerate into an array, eg. &vector[0], and often this is used). Also most users who are interested in the sparse cases are also interested in taking advantage of the sparsity, hence need it exposed. For example, your sparse vector's iterator would have to iterate over all elements, including the empties, which is simply wasteful. The whole point of a sparse structure is to skip all that. If your algorithms can't handle that then you leave a lot of potential gains on the table.

Boost has a sparse vector. I don't think one exists in the standard library.
std::unordered_map is probably a better choice though in the long run though, unless you're already using Boost. The main annoyance in refactoring will be that size() means something different in a map vs. sparse array. Range-based for loops should make that easier to deal with though.

Related

What are the time complexities for size?

I am studying the complexity of various operations of the different STL containers. Through a different question on this site I have found this chart.
website link
One operation I noticed was missing form this chart was the size operation.
I would suppose that if one knew the complexity of .begin and .end one could also calculate the complexity for size. But those are also missing.
I have found an answer similar to what I am looking for in this question, but this one is for Java so it does not cover all the STL containers and it only defines the big O of size for a few of the given datatypes.
Does anyone know the complexity for the .size operation of the various containers or could someone give me a pointer as to where I could find these complexities. Any help would be greatly appreciated.
Also, if my question is wrongly phrased and/or off-topic. Do not hesitate to suggest an edit.
Since C++11, the complexity of the size member function is constant for all standard containers.
std::forward_list which is an implementation of the singly linked list data structure does not provide a size member function. The size can be calculated in linear time using the iterators.
Aside from standard C++ containers, all data structures can be augmented with a separately stored size variable to achieve such constant complexity at the cost of small constant overhead on insert and delete operations. Array is special in regard that it does not require any additional overhead assuming iterator to end is stored.

List of arrays vs list [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
A list uses a lot of memory since it adds a pointer to each node, and it is not contiguous, the memory is there fragmented ... A List of arrays in my opinion is a lot better. For example if I am managing 100 object, A list of 5 arrays of 20 is a lot better than a List of 100, only 5 Pointers added vs 100 pointers, we win locality, when using the same array, and we have less fragmentation.
I did some research about this, but I can't find any interesting article about this, so I thought I am missing something.
What can be the benefit of using a List over a List of arrays ?
EDIT : This is definetely not Array vs List ... It is more like why putting only one element per list Node if it's possible to put more
I think this is a valid question as memory location might affect the performance of your program. You can try std::deque as some suggested. However, the statement that a typical implementation uses chunks of memory is a general statement about implementation, not standard. It is therefor not guaranteed to be so.
In C++, you can improve the locality of your data through custom allocators and memory pools. Every STL container takes as a parameter allocator. The default allocator is probably a simple wrapper around new and delete but you can supply your own allocator that uses a memory pool. You can find more about allocators here and here is a link to the C++ default allocator. Here is a dr.dobbs article about this topic. A quote from the article:
If you're using node-based containers (such as maps, sets, or lists), allocators optimized for smaller objects are probably a better way to go.
Only profiling will tell you what works best for your case. Maybe the combination of std::deque and a custom allocator will be the best option.
Some operations have guaranteed constant-time complexity with std::list and not with std::vector or std::deque. For example, with std::list you can move a subsequence ("splice") of 5000 consecutive elements from the middle of a list with a million items to the beginning (of the same list only!) in constant time. No way to do that with either std::vector or std::deque, and even for your list-of-arrays, it will be a fair bit more complicated.
This is a good read: Complexity of std::list::splice and other list containers

Why prefer std::vector over std::deque? [duplicate]

This question already has answers here:
Why would I prefer using vector to deque
(10 answers)
Closed 3 years ago.
They both have access complexity of O(1) and random insertion/removal complexity of O(n). But vector costs more when expanding because of reallocation and copy, while deque does not have this issue.
It seems deque has a better performance, but why most people use vector instead of deque?
why most people use vector instead of deque?
Because this is what they have been taught.
vector and deque serve slightly different purposes. They can both be used as a simply container of objects, if that's all you need. When learning to program C++, that is all most people need -- a bucket to drop stuff in to, get stuff out of, and walk over.
When StackOverflow is asked a question like "which container should I use by default," the answer is almost invariably vector. The question is generally asked from the context of learning to program in C++, and at the point where a programmer is asking such a question, they don't yet know what they don't know. And there's a lot they don't yet know. So, we (StackOverflow) need a container that fits almost every need for better or worse, can be used in almost any context, and doesn't require that the programmer has asked all the right questions before landing on something approximating the correct answer. Furthermore, the Standard specifically recommends using vector. vector isn't best for all uses, and in fact deque is better than vector for many common uses -- but it's not so much better for a learning programmer that we should vary from the Standard's advice to newbie C++ programmers, so StackOverflow lands on vector.
After having learned the basics of the syntax and, shall we say, the strategies behind programming in C++, programmers split in to two branches: those who care to learn more and write better programs, and those who don't. Those who don't will stick on vector forever. I think many programmers fall in to this camp.
The rarer programmers who try to move beyond this phase start asking other questions -- questions like you've asked here. They know there is lots they don't yet know, and they want to start discovering what those things are. They will quickly (or less quickly) discover that when choosing between vector and deque, some questions they didn't think to ask before are:
Do I need the memory to be contigious?
Do I need to avoid lots of reallocations?
Do I need to keep valid iterators after insertions?
Do I need my collection to be compatible with some ancient C-like function?
Then they really start thinking about the code they are writing, discover yet more stuff they don't know, and the beat goes on...
From C++ standard section 23.1.1:
vector is the type of sequence that should be used by default... deque is
the data structure of choice when most insertions and deletions take place
at the beginning or at the end of the sequence.
However there are some arguments in the opposite direction.
In theory vector is at least as efficient as deque as it provides a subset of its functionality. If your task only needs what vector's interface provides, prefer vector - it can not be worse than a deque.
But vector costs more when expanding because of reallocation and copy
While it's true that vector sometimes has to reallocate its array as it grows, it will grow exponentially so that the amortised complexity is still O(1). Often, you can avoid reallocations by judicious use of reserve().
It seems deque has a better performance
There are many aspects to performance; the time taken by push_back is just one. In some applications, a container might be modified rarely, or populated on start-up and then never modified. In cases like that, iteration and access speed might be more important.
vector is the simplest possible container: a contiguous array. That means that iteration and random access can be achieved by simple pointer arithmetic, and accessing an element can be as fast as dereferencing a pointer.
deque has further requirements: it must not move the elements. This means that a typical implementation requires an extra level of indirection - it is generally implemented as something like an array of pointers to arrays. This means that element access requires dereferencing two pointers, which will be slower than one.
Of course, often speed is not a critical issue, and you choose containers according to their behavioural properties rather than their performance. You might choose vector if you need the elements to be contiguous, perhaps to work with an API based on pointers and arrays. You might choose deque or list if you want a guarantee that the elements won't move, so you can store pointers to them.
For cplusplus :
Therefore they provide a similar functionality as vectors, but with
efficient insertion and deletion of elements also at the beginning of
the sequence, and not only at its end. But, unlike vectors, deques are
not guaranteed to store all its elements in contiguous storage
locations, thus not allowing direct access by offsetting pointers to
elements.
Personally, I prefer using deque (I always end up spoiling myself and having to use push_front for some reason or other), but vector does have its uses/differences, with the main one being:
vector has contiguous memory, while a deque allocates via pages/chunks.
Note, the pages/chunks are fairly useful: constant-time insert/erase on the front of the container. It is also typical that a large block of memory broken up into a series of smaller blocks is more efficient than a single block of a memory.
You could also argue that, because deque is 'missing' size reservation methods (capacity/reserve), you have less to worry about.
I highly suggest you read Sutters' GotW on the topic.

Practical summary/reference of C++11 containers/adapters properties? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm looking for a comprehensive summary/reference of important properties of the various C++11 standard containers and container adapters (optionally also including boost/Qt), but indexed by those properties rather than the usual per container documentation (more on that below).
The properties I have in mind include, from the top of my head:
Insertion capabilities (front / back / arbitrary).
Removal capabilities (front / back / arbitrary).
Access capabilities (front / back / uni/bi-directional traversal / random access).
Complexity of the aforementioned operations, and iterator invalidation rules.
Uniqueness? Ordered? Associative? Contiguous storage? Reservation ahead of time?
I may have forgotten some in which case don't hesitate to comment/edit.
The goal is to use that document as an aid to choose the right container/adapter for the right job, without having to wade through the various individual documentations over and over every time (I have a terrible memory).
Ideally it should be indexed both by property and by container type (eg. table-like) to allow for decision-making as well as for quick reference of the basic constraints. But really the per property indexes are the most important for me since this is the most painful to search in the documentation.
I'd be very surprised if nobody had already produced such a document, but my Search-fu is failing me on this one.
NOTE: I'm not asking for you to summarize all these informations (I'll do that myself if I really have to, in which case I'll post the result here) but only if you happen to know an existing document that fits those requirements. Something like this is a good start but as you can see it still lacks many of the information I'd like to have since it's restricted to member functions.
Thanks for your attention.
I am not aware of a single document that provides everything you need, but most of it has been catalogued somewhere.
This reference site has a large table with all the member functions of all the containers
This SO question has a large table of the complexity guarantees.
This SO question gives you a decision tree to choose between containers.
The complexity requirements for container member functions are not too hard to memorize since there are only 4 categories: (amortized) O(1), O(log N), O(N), and O(N log N) (member function std::list::sort() which really crosses into the algorithms domain of the Standard Library) so if you want you could make a 4-color-coded version of the cpp reference container table.
Choosing the right container can be as simple as always using std::vector unless your profiler indicates a bottleneck. After you reach that point, you have to make hard tradeoffs between space / time complexity, data locality, ease of lookup vs ease of insertion / modification, vs extra invariants (sortedness, uniqueness, iterator invalidation rules).
The hardest part is that you have to balance your containers (space requirements) against the algorithms that you are using (time requirements). Containers can maintain invariants (e.g. std::map is sorted on its keys) that other containers can only mimic using algorithms (e.g. std::vector with std::sort, but without the same insertion complexity). So after you finish the container table, make sure to do something similar for the algorithms!
Finally, no container summary would be complete without mentioning Boost.MultiIndex: because sometimes you don't have to choose!

Super high performance C/C++ hash map (table, dictionary) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?
Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.
If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.
khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.
from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c
First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck
Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.
I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.
http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)