Pointers or Indexes? - c++

I have a network-like data structure, composed by nodes linked together.
The nodes, whose number will change, will be stored in a std::vector<Node> in no particular order, where Node is an appropriate class.
I want to keep track of the links between nodes. Again, the number of these links will change, and I was thinking about using again a std::vector<Link>. The Link class has to contain the information about the two nodes it's connecting, as well as other link features.
Should Link contain
two pointers to the two nodes?
two integers, to be used as an indexes for the std::vector<Node>?
or should I adopt a different system (why?)
the first approach, although probably better, is problematic as the pointers will have to be regenerated every time I add or remove nodes from the network, but on the other hand that will free me from e.g. storing nodes in a random-access container.

This is difficult to answer in general. There are various performance and ease-of-use trade-offs.
Using pointers can provide a more convenient usage for some operations. For example
link.first->value
vs.
nodes[link.first].value
Using pointers may provide better or worse performance than indices. This depends on various factors. You would need to measure to determine which is better in your case.
Using indices can save space if you can guarantee that there are only a certain number of nodes. You can then use a smaller data type for the indices, whereas with pointers you always need to use the full pointer size no matter how many nodes you have. Using a smaller data type can have a performance benefit by allowing more links to fit within a single cache line.
Copying the network data structure will be easier with indices, since you don't have to recreate the link pointers.
Having pointers to elements of a std::vector can be error-prone, since the vector may move the elements to another place in memory after an insert.
Using indices will allow you to do bounds checking, which may make it easier to find some bugs.
Using indices makes serialization more straightforward.
All that being said, I often find indices to be the best choice overall. Many of the syntactical inconveniences of indices can be overcome by using convenience methods, and you can switch the indices to pointers during certain operations where pointers have better performance.

Specify the interface for the class you want to use or create. Write unit tests. Do the most simple thing to fulfill the unit tests.
So it depends on the interface of the class. For example if a Link doesn't export information about the nodes, then it doesn't really matter what approach you chose. On the other hand if you go for pointers, consider std::shared_ptr.

I would add a (or a number of) link pointer to your Node class and then hand maintain the links. This will save you having to use an additional container.
If you are looking for something a bit more structured you can try using Boost Intrusive. This effectively does the same thing in a more generalized fashion.

You can avoid the Link class altogether if you use:
struct Node
{
std::vector<Node*> parents;
std::vector<Node*> children;
};
With this approach,
You avoid creating another class.
Your memory requirements are reduced.
You have to make fewer pointer traversals to traverse the network of Nodes.
Downside. You have to make sure that:
When creating or removing a link you have to update two objects.
When you delete a Node, you have to remove pointers to it from its parents and children.

You could make it a std::vector<Node *> instead of std::vector<Node> and allocate the nodes with new.
Then:
You can store the pointers to the nodes in the Link class without fear of them becoming invalidated
You can still randomly access them in the nodes vector.
Downside is that you will need to remember to delete them when they are removed from the node list.

My Personal experience with vectors in graph like structures has brought up these invariants.
Don't store data in vectors, where other classes hold a pointer/reference
You have a graph like data structure. If the code is not performance critical (this is something different to performance sensitive!) you should not consider cache compacting your data structures.
If you don't know how large your graph will be and you have got your Node data in a vector all iterators and pointers are invalidated once your vector calls vector::reallocate() this means that you have to somehow have to regenerate your whole data structure and perhaps you have to create a copy of all of it and use dfs or similar to adjust the pointers. The same thing will happen if you want to remove data in the middle of one of your vectors.
If you know how large your data will be you'll be set in stone to keep it that way or you'll have huge headaches once you reconsider.
Don't use shared pointers to keep track of what needs to be freed
If you have a graph like data structure and you delete on performance critical paths it's unwise to call delete whenever your algorithm decides he doesn't need the data anymore. One possibility is to keep data on the heap (if it is performance critical consider a pool allocator) mark objects you don't need anymore either during your performance critical sections (if you really really need to save space you can consider pointer tagging) or use some simple mark and sweep algorithm afterwards to find items no longer needed (yes graph algorithms are one of those cases where sutter is saying garbage collection is faster than smart pointers).
Be aware that deferred destruction of objects means that you loose all RAII like features in your Node classes.

Related

Efficiently manage handle for an array binary heap in C++?

Is there a way to efficiently keep track of handles in an array binary heap?
Since there's no fast lookups built into traditional binary heaps, users need a 'handle_type' for delete or decrease_key on an arbitrary element in the heap.
You can't use a pointer or an index into the array, because heap operations will move the element's location around. I can't think of a simple solution using only stack-like structures the way a traditional heap implementation does. Otherwise, I'd have to use 'new/delete' and that feels inefficient.
I don't want to get too obsessed about pre-mature optimization, but I want this to be a part of my standard library of utilities, and so I'd like to put in a little effort to get to know what is considered best practice for this sort of thing.
Maybe just a naive implementation using 'new/delete' is the way to go here. But I'd like to get some advice from people smarter than me first, if that's ok.
The priority queue implementation in the C++ standard library seems to side-step this issue entirely by simply not supporting a 'decrease_key' operation. I've been leafing through CLRS, and they mention this issue but don't really discuss it:
we shall not pursue them here, other than noting that... these handles
do need to be correctly maintained.
Is there a simple approach here I'm overlooking? What do "serious" libraries do when they need a general purpose array heap that needs a 'decrease_key' operation?
Is there a way to efficiently keep track of handles in an array binary tree?
It's hypothetically possible (but not very pretty). Internally, instead of storing an array of elements, the data structure would store an array of pointers (or smart pointers) to a struct containing an pair of an index and and element.
When an element is first inserted into position i in the array, the data structure would initialize the index of this struct to i.
As the data structure moves elements around the array, it should modify the index to reflect the new position.
The result of push can be a pointer to this struct (possibly wrapped in an opaque class). In order to access a specific element (e.g., for decrease_key), you would call some method of the heap with this return value. The heap would then
Know the address of the array (it is its member, after all)
Know the index within the array, through the struct you just sent it.
It could thereby implement decrease_key, for example.
However, there are probably better (and less cumbersome) solutions. Note that the above solution wouldn't alter the asymptotic complexity of the array heap, but the constants would be worse. Conversely, if you look at this summary of heap running times, you can see that a binary heap doesn't really have good performance for decrease_key operations. If you need that, you're probably better off with a Fibonnacci heap (or some other data structure). This leads to your final question
What do "serious" libraries do when they need a general purpose array heap that needs a 'decrease_key' operation?
Libraries such as boost::heap usually indeed implement other data structures more suited for more advanced operations (e.g., decrease_key). These data structures are naturally node based, and naturally support return values which are not invalidated so easily as in the array.

Containing single data in multiple containers

I've lots of data to acquire and process (near a million) and i don't wanna copy or move it along the whole program.
Let me describe the situation with an example. I have a Vector with 100.000 elements. And i want to keep the track of the time when these elements were inserted into the vector. So, it's a good idea to keep both time and data in a Map. However, i still want to use the Vector.
Is there any way to achieve that second elements of Map shows the Vector but not waste any resource unnecessarily?
The first thing that comes to my mind is containing the adress of datas in Vector. However, pointers use 4 bytes(not sure) and for example, if we wanna contain the address of char, it is 4 times bigger than the data itself.
Any ideas ?
I'd say it's not solely a question of memory consumption, but of consistency also. Depends on how you're going to use various views on your original input data. In general I'd advise using std::unique_ptr for the original data and std::weak_ptr for reference in views.
But you're right that might have a certain memory usage overhead because of pointer sizes exceed size of the pointee objects.
For the latter case having a kind of FlyWeight pattern implementation might be more appropriate.
Containing single data in multiple containers
Yes, you can use the Boost Multi-index Containers Library to index the same data in more than one way with no duplication of content. Unlike homebrew XXX_ptr solutions, the multi-index containers also takes care of keeping everything coherent (removing a data unit from the container automatically elides it from all indexes.)
Lighter weight, more specialized solutions (and possibly more efficient than either muklti-index containers and/or homebrew XXX_ptr solutions) may also be possible depending on your application's requirements, particularities and data insertion/lifecycle patterns:
Do you need the memory layout of your original vector to remain unchanged, or can you accommodate a vector of some derived type?
Will your vector contents change? can it grow?
Will the elements therein be (inserted & kept) in chronological order anyway?
Do you need to look up a vector element by insertion time in addition to vector position?

Are there versions of the C++ STL's associative data structures optimized for numerous partial copies?

I have a large tree that grows as my algorithm progresses. Each node contains set, which I suppose is implemented as balanced binary search tree. Each node's set shall remain fixed after that node's creation, before its use in creating that node's children.
I fear however that copying each set is prohibitively expensive. Instead, I would prefer that each newly created node's set utilize all appropriate portions of the parent node's set. In short, I'm happy copying O(log n) of the set but not O(n).
Are there any variants of the STL's associative data structures that offer such an partial copy optimization? Perhaps in Boost? Such a data structure would be trivial to implement in Haskell or OCaML of course, but it'd require more effort in C++.
I know it's not generally productive to suggest a different language, but Haskell's standard container libraries do exactly this. I remember seeing a video (was it Simon Peyton Jones?) talking about this exact problem, and how a Haskell solution ended up being much faster than a C++ solution for the given programmer effort. Of course, this was for a specific problem that had a lot of sets with a lot of shared elements.
There is a fair amount of research into this subject. If you are looking for keywords, I suggest searching for "functional data structures" instead of "immutable data structures", since most functional paradigms benefit from immutability in general. Structures such as finger tree were developed to solve exactly this problem.
I know of no C++ library that implements these data structures. There is nothing stopping you from reading the relevant papers (or the Haskell source code, which is about 1k lines for Data.Set including tests) and implementing it yourself, but I know that is not what you'd want to hear. You'd also need do some kind of reference counting for the shared nodes, which for such deep structures can have a higher overhead than even simple garbage collectors.
It's practically impossible in C++, since the notion of an immutable
container doesn't exist. You may know that you'll be making no changes,
and that some sort of shared representation would be preferable, but the
compiler and the library don't, and there's no way of communicating this
to them.
Each node contains set, which I suppose is implemented as balanced
binary search tree. Each node's set shall remain fixed after that
node's creation, before its use in creating that node's children.
That's a pretty unique case. I would recommend using std::vector instead. (No really!) The code is creating the node can still use a set, and switching to a vector at the last second. However, the vector is smaller, and takes only a tiny number of memory allocations (one if you use reserve), making the algorithm much faster.
typedef unsigned int treekeytype;
typedef std::vector<unsigned int> minortreetype;
typedef std::pair<treekeytype, minortreetype> majornode
typedef std::set<treekeytype, minortreetype> majortype;
majortype majortree;
void func(majortype::iterator perform) {
std::set<unsigned int> results;
results.assign(perform->second.begin(), perform->second.end());
majortree[perform->first+1].assign(results.begin(), results.end()); //the only change is here
majortype::iterator next = majortree.find(perform->first+1);
func(next);
}
You can even use std::lower_bound and std::upper_bound to still get O(log(n)) memory accesses since it's still sorted the same as the set was, so you won't lose any efficiency. It's pure gain as long as you don't need to insert/remove frequently.
I fear however that copying each set is prohibitively expensive.
If this fear is caused because each set contains mostly the same nodes as it's parents, and the data is costly (to copy or in memory, whichever), with only a few nodes changed, make the subtrees contain std::shared_pointers instead of the data themselves. This means the data itself will not get copied, only the pointers.
I realize this isn't what you were aiming at with the question, but as JamesKanze said, I know of no such container. Other than possibly a bizarre and dangerous use of the STL's rope class. Note that I said and meant STL, not the standard C++ library. They're different.

Least Recently Used cache using C++

I am trying to implement LRU Cache using C++ . I would like to know what is the best design for implementing them. I know LRU should provide find(), add an element and remove an element. The remove should remove the LRU element. what is the best ADTs to implement this
For ex: If I use a map with element as value and time counter as key I can search in O(logn) time, Inserting is O(n), deleting is O(logn).
One major issue with LRU caches is that there is little "const" operations, most will change the underlying representation (if only because they bump the element accessed).
This is of course very inconvenient, because it means it's not a traditional STL container, and therefore any idea of exhibiting iterators is quite complicated: when the iterator is dereferenced this is an access, which should modify the list we are iterating on... oh my.
And there are the performances consideration, both in term of speed and memory consumption.
It is unfortunate, but you'll need some way to organize your data in a queue (LRU) (with the possibility to remove elements from the middle) and this means your elements will have to be independant from one another. A std::list fits, of course, but it's more than you need. A singly-linked list is sufficient here, since you don't need to iterate the list backward (you just want a queue, after all).
However one major drawback of those is their poor locality of reference, if you need more speed you'll need to provide your own custom (pool ?) allocator for the nodes, so that they are kept as close together as possible. This will also alleviate heap fragmentation somewhat.
Next, you obviously need an index structure (for the cache bit). The most natural is to turn toward a hash map. std::tr1::unordered_map, std::unordered_map or boost::unordered_map are normally good quality implementation, some should be available to you. They also allocate extra nodes for hash collision handling, you might prefer other kinds of hash maps, check out Wikipedia's article on the subject and read about the characteristics of the various implementation technics.
Continuing, there is the (obvious) threading support. If you don't need thread support, then it's fine, if you do however, it's a bit more complicated:
As I said, there is little const operation on such a structure, thus you don't really need to differentiate Read/Write accesses
Internal locking is fine, but you might find that it doesn't play nice with your uses. The issue with internal locking is that it doesn't support the concept of "transaction" since it relinquish the lock between each call. If this is your case, transform your object into a mutex and provide a std::unique_ptr<Lock> lock() method (in debug, you can assert than the lock is taken at the entry point of each method)
There is (in locking strategies) the issue of reentrance, ie the ability to "relock" the mutex from within the same thread, check Boost.Thread for more information about the various locks and mutexes available
Finally, there is the issue of error reporting. Since it is expected that a cache may not be able to retrieve the data you put in, I would consider using an exception "poor taste". Consider either pointers (Value*) or Boost.Optional (boost::optional<Value&>). I would prefer Boost.Optional because its semantic is clear.
The best way to implement an LRU is to use the combination of a std::list and stdext::hash_map (want to use only std then std::map).
Store the data in the list so that
the least recently used in at the
last and use the map to point to the
list items.
For "get" use the map to get the
list addr and retrieve the data
and move the current node to the
first(since this was used now) and update the map.
For "insert" remove the last element
from the list and add the new data
to the front and update the map.
This is the fastest you can get, If you are using a hash_map you should almost have all the operations done in O(1). If using std::map it should take O(logn) in all cases.
A very good implementation is available here
This article describes a couple of C++ LRU cache implementations (one using STL, one using boost::bimap).
When you say priority, I think "heap" which naturally leads to increase-key and delete-min.
I would not make the cache visible to the outside world at all if I could avoid it. I'd just have a collection (of whatever) and handle the caching invisibly, adding and removing items as needed, but the external interface would be exactly that of the underlying collection.
As far as the implementation goes, a heap is probably the most obvious. It has complexities roughly similar to a map, but instead of building a tree from linked nodes, it arranges items in an array and the "links" are implicit based on array indices. This increases the storage density of your cache and improves locality in the "real" (physical) processor cache.
I suggest a heap and maybe a Fibonacci Heap
I'd go with a normal heap in C++.
With the std::make_heap (guaranteed by the standard to be O(n)), std::pop_heap, and std::push_heap in #include, implementing it would be absolutely cake. You only have to worry about increase-key.

selection of data structure

I use C++, say i want to store 40 usernames, I will simply use an array. However, if I want to store 40000 usernames is this still a good idea in terms of search speed? Which data structure should I use to improve this speed?
You need to specify what the insertion and removal requirements are. Do things need to be removed and inserted at random points in the sequence?
Also, why the requirement to search sequentially? Are you doing searches that aren't suitable for a hash table lookup?
At the moment I'd suggest a deque or a list. Often it's best to choose a container with the interface that makes for the simplest implementation for your algorithm and then only change the choice if the performance is inadequate and an alternative provides the necessary speedup.
A vector has two principle advantages, there is no per-object memory overhead, although vectors will over-allocate to prevent frequent copying and objects are stored contiguously so sequential access tends to be fast. These are also its disadvantages. Growing vectors require reallocation and copying, and insertion and removal from anywhere other than the end of the vector also require copying. Contiguous storage can produce problems for vectors with large numbers of objects or large objects as the contiguous storage requirements can be hard to satisfy even with only mild memory fragmentation.
A list doesn't require contigous storage but list nodes usually have a per-object overhead of two pointers (in most implementation). This can be significant in list of very small objects (e.g. in a list of pointers, each node is 3x the size of the data item). Insertion and removal from the middle of a list is very cheap though and list nodes never need to me moved in memory once created.
A deque uses chunked storage, so it has a low per-object overhead similar to a vector, but doesn't require contiguous storage over the whole container so doesn't have the same problem with fragmented memory spaces. It is often a very good choice for collections and is often overlooked.
As a rule of thumb, prefer vector to list or, diety forbid, C-style array.
After the vector is filled, make sure it is properly ordered using the sort algorithm. You can then search for a particular record using either find, binary_search or lower_bound. (You don't need to sort to use find.)
Seriously unless you are in a resource constrained environment (embedded platform, phone, or other). Use a std::map, save the effort of doing sorting or searching and let the container take care of everything. This will possibly be a sorted tree structure, probably balance (e.g. Red-Black), which means you will get good searching performance. Unless the size of you data is close to the size of one or two pointers, the memory overhead of whatever data structure you pick is negligable. You Graphics Card probably has more memory that you are going to use up for the data you are think about.
As others said there is very little good reason to use vanilla array, if you don't want to use a map use std::vector or std::list depending on whether you need insert/delete data (=>list) or not (=>vector)
Also consider if you really need all that data in memory, how about putting it on disk via sqlite. Or even use sqlite for in memory access. It all depends on what you need to do with your data.
std::vector and std::list seem good for this task. You can use an array if you know the maximum number of records beforehands.
If you need only sequentially search and storage, then list is the proper container.
Also, vector wouldn't be a bad choice.