Containing single data in multiple containers - c++

I've lots of data to acquire and process (near a million) and i don't wanna copy or move it along the whole program.
Let me describe the situation with an example. I have a Vector with 100.000 elements. And i want to keep the track of the time when these elements were inserted into the vector. So, it's a good idea to keep both time and data in a Map. However, i still want to use the Vector.
Is there any way to achieve that second elements of Map shows the Vector but not waste any resource unnecessarily?
The first thing that comes to my mind is containing the adress of datas in Vector. However, pointers use 4 bytes(not sure) and for example, if we wanna contain the address of char, it is 4 times bigger than the data itself.
Any ideas ?

I'd say it's not solely a question of memory consumption, but of consistency also. Depends on how you're going to use various views on your original input data. In general I'd advise using std::unique_ptr for the original data and std::weak_ptr for reference in views.
But you're right that might have a certain memory usage overhead because of pointer sizes exceed size of the pointee objects.
For the latter case having a kind of FlyWeight pattern implementation might be more appropriate.

Containing single data in multiple containers
Yes, you can use the Boost Multi-index Containers Library to index the same data in more than one way with no duplication of content. Unlike homebrew XXX_ptr solutions, the multi-index containers also takes care of keeping everything coherent (removing a data unit from the container automatically elides it from all indexes.)
Lighter weight, more specialized solutions (and possibly more efficient than either muklti-index containers and/or homebrew XXX_ptr solutions) may also be possible depending on your application's requirements, particularities and data insertion/lifecycle patterns:
Do you need the memory layout of your original vector to remain unchanged, or can you accommodate a vector of some derived type?
Will your vector contents change? can it grow?
Will the elements therein be (inserted & kept) in chronological order anyway?
Do you need to look up a vector element by insertion time in addition to vector position?

Related

Pointers or Indexes?

I have a network-like data structure, composed by nodes linked together.
The nodes, whose number will change, will be stored in a std::vector<Node> in no particular order, where Node is an appropriate class.
I want to keep track of the links between nodes. Again, the number of these links will change, and I was thinking about using again a std::vector<Link>. The Link class has to contain the information about the two nodes it's connecting, as well as other link features.
Should Link contain
two pointers to the two nodes?
two integers, to be used as an indexes for the std::vector<Node>?
or should I adopt a different system (why?)
the first approach, although probably better, is problematic as the pointers will have to be regenerated every time I add or remove nodes from the network, but on the other hand that will free me from e.g. storing nodes in a random-access container.
This is difficult to answer in general. There are various performance and ease-of-use trade-offs.
Using pointers can provide a more convenient usage for some operations. For example
link.first->value
vs.
nodes[link.first].value
Using pointers may provide better or worse performance than indices. This depends on various factors. You would need to measure to determine which is better in your case.
Using indices can save space if you can guarantee that there are only a certain number of nodes. You can then use a smaller data type for the indices, whereas with pointers you always need to use the full pointer size no matter how many nodes you have. Using a smaller data type can have a performance benefit by allowing more links to fit within a single cache line.
Copying the network data structure will be easier with indices, since you don't have to recreate the link pointers.
Having pointers to elements of a std::vector can be error-prone, since the vector may move the elements to another place in memory after an insert.
Using indices will allow you to do bounds checking, which may make it easier to find some bugs.
Using indices makes serialization more straightforward.
All that being said, I often find indices to be the best choice overall. Many of the syntactical inconveniences of indices can be overcome by using convenience methods, and you can switch the indices to pointers during certain operations where pointers have better performance.
Specify the interface for the class you want to use or create. Write unit tests. Do the most simple thing to fulfill the unit tests.
So it depends on the interface of the class. For example if a Link doesn't export information about the nodes, then it doesn't really matter what approach you chose. On the other hand if you go for pointers, consider std::shared_ptr.
I would add a (or a number of) link pointer to your Node class and then hand maintain the links. This will save you having to use an additional container.
If you are looking for something a bit more structured you can try using Boost Intrusive. This effectively does the same thing in a more generalized fashion.
You can avoid the Link class altogether if you use:
struct Node
{
std::vector<Node*> parents;
std::vector<Node*> children;
};
With this approach,
You avoid creating another class.
Your memory requirements are reduced.
You have to make fewer pointer traversals to traverse the network of Nodes.
Downside. You have to make sure that:
When creating or removing a link you have to update two objects.
When you delete a Node, you have to remove pointers to it from its parents and children.
You could make it a std::vector<Node *> instead of std::vector<Node> and allocate the nodes with new.
Then:
You can store the pointers to the nodes in the Link class without fear of them becoming invalidated
You can still randomly access them in the nodes vector.
Downside is that you will need to remember to delete them when they are removed from the node list.
My Personal experience with vectors in graph like structures has brought up these invariants.
Don't store data in vectors, where other classes hold a pointer/reference
You have a graph like data structure. If the code is not performance critical (this is something different to performance sensitive!) you should not consider cache compacting your data structures.
If you don't know how large your graph will be and you have got your Node data in a vector all iterators and pointers are invalidated once your vector calls vector::reallocate() this means that you have to somehow have to regenerate your whole data structure and perhaps you have to create a copy of all of it and use dfs or similar to adjust the pointers. The same thing will happen if you want to remove data in the middle of one of your vectors.
If you know how large your data will be you'll be set in stone to keep it that way or you'll have huge headaches once you reconsider.
Don't use shared pointers to keep track of what needs to be freed
If you have a graph like data structure and you delete on performance critical paths it's unwise to call delete whenever your algorithm decides he doesn't need the data anymore. One possibility is to keep data on the heap (if it is performance critical consider a pool allocator) mark objects you don't need anymore either during your performance critical sections (if you really really need to save space you can consider pointer tagging) or use some simple mark and sweep algorithm afterwards to find items no longer needed (yes graph algorithms are one of those cases where sutter is saying garbage collection is faster than smart pointers).
Be aware that deferred destruction of objects means that you loose all RAII like features in your Node classes.

Would I see a performance gain using std::map instead of vector<pair<string, string> >?

I currently have some code where I am using a vector of pair<string,string>. This is used to store some data from XML parsing and as such, the process is quite slow in places. In terms of trying to speed up the entire process I was wondering if there would be any performance advantage in switching from vector<pair<string,string> > to std::map<string,string> ? I could code it up and run a profiler, but I thought I would see if I could get an answer that suggests some obvious performance gain first. I am not required to do any sorting, I simply add items to the vector, then at a later stage iterate over the contents and do some processing - I have no need for sorting or anything of that nature. I am guessing that perhaps I would not get any performance gain, but I have never actually used a std::map before so I don't know without asking or coding it all up.
No. If (as you say) you are simply iterating over the collection, you will see a small (probably not measurable) performance decrease by using a std::map.
Maps are for accessing a value by its key. If you never do this, map is a bad choice for a container.
If you are not modifying your vector<pair<string,string> > - just iterating it over and over - you will get perfomance degradation by using map. This is because typical map is organized with binary tree of objects, each of which can be allocated in different memory blocks (unless you write own allocator). Plus, each node of map manages pointers to neighbor objects, so it's time and memory overhead, too. But, search by key is O(log) operation. On other side, vector holds data in one block, so processor cache usually feels better with it. Searching in vector is actually O(N) operation which is not so good but acceptable. Search in sorted vector can be upgraded to O(log) using lower_bound etc functions.
It depends on operations you doing on this data. If you make many searches - probably its better to use hashing container like unordered_map since search by key in this containers is O(1) operation. For iterating, as mentioned, vector is faster.
Probably it is worth to replace string in your pair, but this highly depends on what you hold there and how access container.
The answer depends on what you are doing with these data structures and what the size of them is. If you have thousands of elements in your std::vector<std::pair<std::stringm std::string> > and you keep searching for the first element over and over, using a std::map<std::string, std::string> may improve the performance (you might want to consider using std::unordered_map<std::string, std::string> for this use case, instead). If your vectors are relatively small and you don't trying to insert elements into the middle too often, using vectors may very well be faster. If you just iterate over the elements, vectors are a lot faster than maps: iterations isn't really one of their strength. Maps are good at looking things up, assuming the number of elements isn't really small because otherwise a linear search over a vector is still faster.
The best way to determine where the time is spent is to profile the code: it is often not entirely clear up front where the time is spent. Frequently, the suspected hot-spots are actually non-problematic and other areas show unexpected performance problems. For example, you might be passing your objects my value rather than by reference at some obscure place.
If your usage pattern is such that you perform many insertions before performing any lookups, then you might benefit from implementing a "lazy" map where the elements are sorted on demand (i.e. when you acquire an iterator, perform a lookup, etc).
As C++ say std::vector sort items in a linear memory, so first it allocate a memory block with an initial capacity and then when you want to insert new item into vector it will check if it has more room or not and if not it will allocate a new buffer with more space, copy construct all items into new buffer and then delete source buffer and set it to new one.
When you just start inserting items into vector and you have lot of items you suffer from too many reallocation, copy construction and destructor call.
In order to solve this problem, if you now count of input items (not exact but its usual length) you can reserve some memory for the vector and avoid reallocation and every thing.
if you have no idea about the size you can use a collection like std::list witch never reallocate its internal items.

Dynamic size of array in c++?

I am confused. I don't know what containers should I use. I tell you what I need first. Basically I need a container that can stored X number of Object (and the number of objects is unknown, it could be 1 - 50k).
I read a lot, over here array vs list its says: array need to be resized if the number of objects is unknown (I am not sure how to resize an array in C++), and it also stated that if using a linked list, if you want to search certain item, it will loop through (iterate) from first to end (or vice versa) while an array can specify "array object at index".
Then I went for an other solution, map, vector, etc. Like this one: array vs vector. Some responder says never use array.
I am new to C++, I only used array, vector, list and map before. Now, for my case, what kind of container you will recommend me to use? Let me rephrase my requirements:
Need to be a container
The number of objects stored is unknown but is huge (1 - 40k maybe)
I need to loop through the containers to find specific object
std::vector is what you need.
You have to consider 2 things when selecting a stl container.
Data you want to store
Operations you want to perform on the stored data
There wasa good diagram in a question here on SO, which depitcs this, I cannot find the link to it but I had it saved long time ago, here it is:
You cannot resize an array in C++, not sure where you got that one from. The container you need is std::vector.
The general rule is: use std::vector until it doesn't work, then shift to something that does. There are all sorts of theoretical rules about which one is better, depending on the operations, but I've regularly found that std::vector outperforms the others, even when the most frequent operations are things where std::vector is supposedly worse. Locality seems more important than most of the theoretical considerations on a modern machine.
The one reason you might shift from std::vector is because of iterator validity. Inserting into an std::vector may invalidate iterators; inserting into a std::list never.
Do you need to loop through the container, or you have a key or ID for your objects?
If you have a key or ID - you can use map to be able to quickly access the object by it, if the id is the simple index - then you can use vector.
Otherwise you can iterate through any container (they all have iterators) but list would be the best if you want to be memory efficient, and vector if you want to be performance oriented.
You can use vector. But if you need to find objects in the container, then consider using set, multiset or map.

Choosing a STL Container for a very large list

I have a very large list of items (~2 millions) that I want to optimize for access speed. I iterate trough the items using an iterator (++it).
Right now the code is implemented using std:map<std::wstring, STRUCT>.
I wonder if it's worth to change std::map with a std::deque<std::pair<std::wstring, STRUCT>>. I think I would have advantage of using pointer arithmetic and minimize cache miss. It worths ?
I know that profiling is the answer but I need an opinion before implementing this ...
If you know in advance the size, then std::Vector is clearly the way to go it your objects aren't too big.
std::vector<Object> list;
list.reserve(2000000);
And then fill it as usual.
This is the fastest and least memory consuming approach. However, you need to be able to allocate enought continous memory. But excepted if your object are 1kb big, it shouldn't be a problem.
With deque, you would lose ( or would have to re-implement ) the advantage of Key-Value pairs. If it's not essential for your data, I would consider using deque.
Generally, if you're only doing search in this set (no insertions/deletions), you're probably better off using a sorted sequential cointainer, like deque or vector. You can then use simple binary search to find the needed elements. The advantage of using a sequential container is that it is better in terms of memory usage, has very simple implementation, and provides better locality of reference. I'd write one version of the code using vector, and another version of the code using deque, then compare them in terms of preformance to decide which one to use in the final version.
However, if your structure needs to be updated (new elements need to be inserted or old elements have to be deleted frequently), map is better choice. Or maybe, you just have to drop STL containers altogether and just use an in-memory database (see SQLite), but it highly depends on what problem you're solving.
The fastest container to iterate through is usually a vector, so if you want to optimize for iteration at the expense of everything else, use that.
Overall app performance of course will depend how many times you iterate, and how you construct your data in the first place. For a simple test, once your map has been populated you can construct a vector from it as follows:
vector<pair<K,V> > myvec(mymap.begin(), mymap.end());
Where K and V are the key and value types of the map. Then just use the vector iterators in place of the map iterators and compare performance.
Of course, if you want to modify the map in future, then normally it would not be appropriate to optimize for iteration at the expense of everything else.

selection of data structure

I use C++, say i want to store 40 usernames, I will simply use an array. However, if I want to store 40000 usernames is this still a good idea in terms of search speed? Which data structure should I use to improve this speed?
You need to specify what the insertion and removal requirements are. Do things need to be removed and inserted at random points in the sequence?
Also, why the requirement to search sequentially? Are you doing searches that aren't suitable for a hash table lookup?
At the moment I'd suggest a deque or a list. Often it's best to choose a container with the interface that makes for the simplest implementation for your algorithm and then only change the choice if the performance is inadequate and an alternative provides the necessary speedup.
A vector has two principle advantages, there is no per-object memory overhead, although vectors will over-allocate to prevent frequent copying and objects are stored contiguously so sequential access tends to be fast. These are also its disadvantages. Growing vectors require reallocation and copying, and insertion and removal from anywhere other than the end of the vector also require copying. Contiguous storage can produce problems for vectors with large numbers of objects or large objects as the contiguous storage requirements can be hard to satisfy even with only mild memory fragmentation.
A list doesn't require contigous storage but list nodes usually have a per-object overhead of two pointers (in most implementation). This can be significant in list of very small objects (e.g. in a list of pointers, each node is 3x the size of the data item). Insertion and removal from the middle of a list is very cheap though and list nodes never need to me moved in memory once created.
A deque uses chunked storage, so it has a low per-object overhead similar to a vector, but doesn't require contiguous storage over the whole container so doesn't have the same problem with fragmented memory spaces. It is often a very good choice for collections and is often overlooked.
As a rule of thumb, prefer vector to list or, diety forbid, C-style array.
After the vector is filled, make sure it is properly ordered using the sort algorithm. You can then search for a particular record using either find, binary_search or lower_bound. (You don't need to sort to use find.)
Seriously unless you are in a resource constrained environment (embedded platform, phone, or other). Use a std::map, save the effort of doing sorting or searching and let the container take care of everything. This will possibly be a sorted tree structure, probably balance (e.g. Red-Black), which means you will get good searching performance. Unless the size of you data is close to the size of one or two pointers, the memory overhead of whatever data structure you pick is negligable. You Graphics Card probably has more memory that you are going to use up for the data you are think about.
As others said there is very little good reason to use vanilla array, if you don't want to use a map use std::vector or std::list depending on whether you need insert/delete data (=>list) or not (=>vector)
Also consider if you really need all that data in memory, how about putting it on disk via sqlite. Or even use sqlite for in memory access. It all depends on what you need to do with your data.
std::vector and std::list seem good for this task. You can use an array if you know the maximum number of records beforehands.
If you need only sequentially search and storage, then list is the proper container.
Also, vector wouldn't be a bad choice.