Graph data structure memory management - c++

I would like to implement a custom graph data structure for my project and I had a question about proper memory management.
Essentially, the data structure will contain nodes that have two vectors: one for edges coming into the node and one for edges coming out of the node (no looped edges). The graph is connected. The graph will also contain one 'entry' node that will have no edges coming into it. An edge is simply a pointer to a node.
My question here is: What would be the best method of clearing up memory for this type of data structure? I understand how to do it if there was only one entry edge (at which point this structure degenerates to a n-ary tree), but I'm not sure what to do in the case where there are multiple nodes that have edges going into a single node. I can't just call delete from an arbitrary entry node because this will likely result in 'double free' bugs later on.
For example, suppose I had this subgraph:
C <-- B
^
|
A
If I were to call delete from node B, I would remove the memory allocated for C, but A would still have a pointer to it. So if I wanted to clear all the nodes A had connections to, I would get a double free error.

You will need to perform a search to figure out which node is still connected to the input edge, when you remove a component. If you end up with more than one connected group, you will need to figure out which one of these contains the entry node and remove all others.
No greedy (local) algorithm for this can exist, which can be shown by a simple thought experiment:
Let A, B be subgraphs connected only through the node n, which shall be removed. We are left with two unconnected subgraphs. There is no way of knowing (without a whole bunch of state per node) if we have just removed the only route to the entry node for A or B. And, it is necessary to figure that out, so that the appropriate choice of removing either A or B can be made.
Even if every node stored every single route to the entry node, it would mean you have to clean up all routes in all nodes whenever you remove a single node.
Solution Sketch
Let us talk about a graphical representation of what we need to do:
First, Color the node that is being deleted black. Then perform the following for every node we encounter:
For uncolored nodes:
If the node we came from is black, give this node a new color
If the node we came from is colored, give this node the same color
Travel through every outgoing edge
For colored nodes:
If the node we came from is black, just return
If the node we came from is the same color, just return
If the node we came from has a different color, merge the two colors (e.g. by remembering that green and blue are the same, or by painting every green node blue)
Travel through every outgoing edge
At the end we will know which connected components will exist after we delete the current node. All connected components (plus our original to be deleted node) which do not contain the entry node must be deleted (Note: This may delete every single node, if our to-be-deleted node was the entry node...)
Implementation
You will need a data structure like the following:
struct cleanup {
vector<set<node*>> colors;
node* to_be_deleted;
size_t entry_component;
};
The index into the vector of lists will be your "color". The "color black" will be represented by usage of to_be_deleted. Finally, the entry_component will contain the index of the color that has the entry node.
Now, the previous algorithm can be implemented. There are quite a few things to consider, and the implementation may end up being different, depending on what kind of support structures you already keep for other operations.

The answer depends on the complexity of the graph:
If the graph is a tree, each parent can own its children and delete them in its destructor.
If the graph is a directed acyclic graph, an easy and performant way to handle it is to do reference counting on the nodes.
If the graph can be cyclic, you are out of luck. You will need to keep track of each and every node in your graph, and then do garbage collection. Depending on your use case, you can either do the collection by
cleaning up everything when you are done with the complete graph, or by
repeatedly marking all connected nodes and cleaning up all the unreachable ones.
If there is any possibility to get away with option 1 or 2 (possibly tweaking the problem to ensure that the graph fulfills the constraint), you should do so; option 3 implies significant overheads in terms of code complexity and runtime.

There are a couple of ways. One way is to make your nodes know what other nodes have edges to it. So, if you delete C from B, C will need to remove the edge to it from A. So later when you remove/delete A, it won't try to delete C.
std::shared_ptr or some other type of reference counting may also work for you.

Here's a simple way to avoiding memory problems when implementing a graph: Don't use pointers to represent edges.
Instead, give each node a unique ID number (an incrementing integer counter will suffice). Keep a global unordered_map<int, shared_ptr<Node> > so that you can quickly look up any Node by its ID number. Then each Node can represent its edges as a set of integer Node IDs.
After you delete a Node (i.e. remove it from the global map of Nodes), it's possible that some other Nodes will now have "dangling edges", but that will be easy to detect and handle because when you go to look up the now-removed Node's ID in your global map, the lookup will fail. You can then gracefully respond by ignoring that edge, or by removing that edge its the source Node, or etc.
The advantages of doing it this way: The code remains very simple, and there is no need to worry about reference-cycles, memory leaks, or double-frees.
The disadvantages: It's a little bit less efficient to traverse the graph (since doing a map lookup takes more cycles than a simple pointer dereference) and (depending on what you are doing) the 'dangling edges' might require occasional cleanup sweeps (but those are easy enough to do... just iterate over the global map, and for each node, check each edge in its edge-set and remove the ones with IDs that aren't present in the global map)
Update: If you don't like doing a lot of unordered_map lookups, you could alternatively get very similar functionality by representing your edges using weak_ptr instead. A weak_ptr will automagically become NULL/invalid when the object it is pointing at goes away.

Related

Should B-Tree nodes contain a pointer to their parent (C++ implementation)?

I am trying to implement a B-tree and from what I understand this is how you split a node:
Attempt to insert a new value V at a leaf node N
If the leaf node has no space, create a new node and pick a middle value of N and anything right of it move to the new node and anything to the left of the middle value leave in the old node, but move it left to free up the right indices and insert V in the appropriate of the now two nodes
Insert the middle value we picked into the parent node of N and also add the newly created node to the list of children of the parent of N (thus making N and the new node siblings)
If the parent of N has no free space, perform the same operation and along with the values also split the children between the two split nodes (so this last part applies only to non-leaf nodes)
Continue trying to insert the previous split's middle point into the parent until you reach the root and potentially split the root itself, making a new root
This brings me to the question - how do I traverse upwards, am I supposed to keep a pointer of the parent?
Because I can only know if I have to split the leaf node when I have reached it for insertion. So once I have to split it, I have to somehow go back to its parent and if I have to split the parent as well, I have to keep going back up.
Otherwise I would have to re-traverse the tree again and again each time to find the next parent.
Here is an example of my node class:
template<typename KEY, typename VALUE, int DEGREE>
struct BNode
{
KEY Keys[DEGREE];
VALUE Values[DEGREE];
BNode<KEY, VALUE, DEGREE>* Children[DEGREE + 1];
BNode<KEY, VALUE, DEGREE>* Parent;
bool IsLeaf;
};
(Maybe I should not have an IsLeaf field and instead just check if it has any children, to save space)
Even if you don't use recursion or an explicit stack while going down the tree, you can still do it without parent pointers if you split nodes a bit sooner with a slightly modified algorithm, which has this key characteristic:
When encountering a node that is full, split it, even when it is not a leaf.
With this pre-emptive splitting algorithm, you only need to keep a reference to the immediate parent (not any other ancestor) to make the split possible, since now it is guaranteed that a split will not lead to another, cascading split more upwards in the tree. This algorithm requires that the maximum degree (number of children) of the B-tree is even (as otherwise one of the two split nodes would have too few keys to be considered valid).
See also Wikipedia which describes this alternative algorithm as follows:
An alternative algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way preemptively. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this algorithm, we must be able to send one element to the parent and split the remaining ๐‘ˆโˆ’2 elements into two legal nodes, without adding a new element. This requires ๐‘ˆ = 2๐ฟ rather than ๐‘ˆ = 2๐ฟโˆ’1, which accounts for why some textbooks impose this requirement in defining B-trees.
The same article defines ๐‘ˆ and ๐ฟ:
Every internal node contains a maximum of ๐‘ˆ children and a minimum of ๐ฟ children.
For a comparison with the standard insertion algorithm, see also Will a B-tree with preemptive splitting always have the same height for any input order of keys?
You don't need parent pointers if all your operations start at the root.
I usually code the insert recursively, such that calling node.insert(key) either returns null or a new key to insert at its parent's level. The insert starts with root.insert(key), which finds the appropriate child and calls child.insert(key).
When a leaf node is reached the insert is performed, and non-null is returned if the leaf splits. The parent would then insert the new internal key and return non-null if it splits, etc. If root.insert(key) returns non-null, then it's time to make a new root

Data structure to represent adjacency list for graph, when receiving ambiguous specs

I am designing an application which should be based on graphs.
I am not sure which is the best way to represent the graph adjacency list in memory. The requirements from the customer are quite vague, so I must make some several assumptions. The nodes of the graphs are some IDs, but I am not sure if the IDs are sequential or not. What does the graph theory say, when it comes to general specifications?
If they are sequential, the number of nodes (N) should also limit the max IDs, and basically it is ensured that the IDs will cover the interval 1,2โ€ฆN. See option A below.
If they are not sequential, the IDs could jump from 1 to e.g. 11, and may skip some natural numbers in the specification. See option B below.
Beside ID, there is also a c++ data structure, where I store multiple info ( payload, connected edges etc.)
There are two options left for my algorithm:
A. Represent the graph as a vector< Data > , and index of vector will mean to the nodeID.
B. Represent the graph as a map , where Node ID is the key, and Data is the storage value.
Map would allow me having random IDs, letโ€™s say that the input data is given randomly.
The literature ( e.g. DFS, BFS or other graph articles) is mostly considering option A, where node IDs fully cover an interval [1..N]. I would also go for this option, as it represents a commonly agreed notation.
Then, add this to the documentation/precondition section of my application.
What is the best option to proper cover customerโ€™s ambiguous specifications?
You could choose to represent a your graph as a combination of your two listed options: have a Node structure that contains two members - an integer label and a the other struct you need.
The graph will store a std::vector<Node*> nodes;. However, given the restriction that a node's label will not match its position in the above vector, you will need to store the correspondence between label and vector indexes in a std::map<int, int> corresp;
Given this structure, if you need to access the Node* with a label value of 11, you would do Node* node = nodes[corresp[label]];
Also, the label could be any other type, for instance a std::string. The only modification that needs to be done is to change the key type of the map to std::string.
Case 1: sequential IDs. Then you may store the nodes in an array in such a way that the indexes correspond to the IDs.
Case 2: sparse IDs.
Usually the representation of the nodes of a graph allows them to have a payload (attributes), such as the ID. If you don't need to access the nodes by ID, use an array and you are done.
If you do need to access the nodes by ID, use a dictionary (map) to establish the correspondence. You can also store the nodes directly in the dictionary, but node enumeration or sorting will be harder.
I usually recommend identifying things with (maybe smart) pointers if they are objects, since that's the mechanism that C/C++ provides to identify objects.
Fundamentally, your graph consists of a number of nodes and edges, so you would generally have something like:
class Node {
int id;
Data data;
std::vector<Node *> edges;
}
Then, in your Graph class, you will need some kind of map for every other way you need to access nodes. You will probably need to be able to find nodes by id, so the graph class will need some kind of index for that -- a vector<Node *> nodesById for dense ids or a map<int,Node*> nodesById for sparse ids. Which one to choose should not be an important decision that has a lot of consequences. Add a method Node *getNodeById(int id), and then you can change the representation whenever you want. Always remember that, in software development, when a decision doesn't have an obvious answer, or when the best answer is likely to change in the future, then making it easy to change your mind is much better than making the right choice.
As people add requirements to your graph, you may need to access nodes in different ways and may have to add more kinds of indexes to support those particular use cases.
Two jobs you will need to do with your graph are construction and destruction. Construction will probably require that nodesById index. Destruction will definately require some way to enumerate all the nodes, and whichever representation you choose for nodesById will suffice for that as well.
You could use a map of vectors. Something like this:
Map<int,vector<Node *>>;
The key in this map would be your node id. The corresponding vector has the first entry as your corresponding node of that particular Id and then all the edges from that Id node.
Suppose, your graph has a node with id 2, and this node has its edges with nodes with id 3,4 and 6.
So your entry corresponding to the key 2 in your map would be a vector, that has its first entry as node with id 2, then next entry as node of id 3, then with 4 and then at last with node 6.
Your each vector entry of Node could look similar to this:
struct Node {
int id,
InfoData obj;
}

Modelling a graph with classes 'edge' and 'node'

I have to model an unweighted graph in C++ and do BFS and DFS. But, I have to create two separate classes: node and edge which would contain the two extremities of type node. I do not understand how would I be able to do any search in the graph using the edge class. If it were me, I would only create the node class and have the neighbours of node X in a list/array inside the class. Is there something I am missing? How could I take advantege of the class edge? Is there any way to do the searches without using an adjacency matrix or list? Thank you!
You can still use adjacency lists. But they will contain not references to other nodes, but rather references to edge instances, which in turn contain both endpoints as node references, as you mention. Of course, that seems kind of redundant because if you have, say source and target nodes in each edge and then you access some edge of a given node using something like node.edges[i].source, then you instantly know that this source is the node itself, otherwise this edge wouldn't even be in this node's adjacency list. But source may still be useful if you pass a reference to edge alone somewhere, where you don't have the source node at hand.
That aside, for the simplest graph this kind of approach seems like an overkill, because edges only store source and destination edges, the former being mostly redundant. But you may need to extend your edge class later with something like weights, labels, auxiliary data like residual flow or something like that. So having such class is a nice idea after all.
Moreover, some algorithms work directly on edges. For example, you may need to search for an edge satisfying some criterion. Having a separate class gives you freedom to create lists of edges without ad-hoc approaches like pair<Node, Node> or something.
You miss the constant of space complexity in storing the edges.
When you store the edges in adjacency matrix/list, you have to store both (node1, node2) and (node2, node1) for each edge.
You double the space, although the big-O complexity stays the same.
However, there're times when such space constant has to be considered.
You can take advantage of the class edge when you would like to save as much space as possible, and you would like to prioritize space over time.
Linear search on all the edge instances is a way.
Linear search is slow but you prioritize space over time.
Maybe parallel search when you have a distributed system is another way, but you have to verify.
Your homework question may have an artificial constraint of the design of the edge class. Artificial constraint idea comes from https://meta.stackexchange.com/questions/10811/how-do-i-ask-and-answer-homework-questions

QuadTree for 2D collision detection

I'm currently working on a 2D shoot them up type of game, and I'm using a quad tree for my collision detections. I wrote a working quad tree that correctly pushes my actors into the nodes/leaves they belong to in the tree. However, I've got a few problems.
Firstly, how do I actually use my quadtree to select against which other objects an object should test collisions ? I'm unsure about how this is done.
Which brings up a second question. Say I have an object in node that is not a neighbour of another node, but that the object is large enough that it spans a few node, how can I check for an actual collision, since I'm guessing the tree might consider it's not close enough to collide with objects in a "far away" node? Should objects that don't completely fit in a node be kept in the parent node?
In my game, most of the objects are of different sizes and moving around.
I've read a good number of blogs/articles about quadtrees but most just explain how to build a tree which is not really what I'm looking for.
Any help/info is welcome.
You can establish a convention that every element is contained in the smallest quadtree node which contains it fully.
Then when you check the collisions for node A, you proceed like this:
current node = root node
check collisions of A with each element directly in current node
if A can be contained entirely in any of sub-nodes of the current node, set the current node to that sub-node and go to 2 again
finally, check collisions of A with all the elements in children nodes of the current node, recursively.
Note that the smaller the objects, the deeper they will be located in the quad tree, hence they will be compared less often.

How can one create cyclic (and immutable) data structures in Clojure without extra indirection?

I need to represent directed graphs in Clojure. I'd like to represent each node in the graph as an object (probably a record) that includes a field called :edges that is a collection of the nodes that are directly reachable from the current node. Hopefully it goes without saying, but I would like these graphs to be immutable.
I can construct directed acyclic graphs with this approach as long as I do a topological sort and build each graph "from the leaves up".
This approach doesn't work for cyclic graphs, however. The one workaround I can think of is to have a separate collection (probably a map or vector) of all of the edges for an entire graph. The :edges field in each node would then have the key (or index) into the graph's collection of edges. Adding this extra level of indirection works because I can create keys (or indexes) before the things they (will) refer to exist, but it feels like a kludge. Not only do I need to do an extra lookup whenever I want to visit a neighboring node, but I also have to pass around the global edges collection, which feels very clumsy.
I've heard that some Lisps have a way of creating cyclic lists without resorting to mutation functions. Is there a way to create immutable cyclic data structures in Clojure?
You can wrap each node in a ref to give it a stable handle to point at (and allow you to modify the reference which can start as nil). It is then possible to possible to build cyclic graphs that way. This does have "extra" indirection of course.
I don't think this is a very good idea though. Your second idea is a more common implementation. We built something like this to hold an RDF graph and it is possible to build it out of the core data structures and layer indices over the top of it without too much effort.
I've been playing with this the last few days.
I first tried making each node hold a set of refs to edges, and each edge hold a set of refs to the nodes. I set them equal to each other in a (dosync... (ref-set...)) type of operation. I didn't like this because changing one node requires a large amount of updates, and printing out the graph was a bit tricky. I had to override the print-method multimethod so the repl wouldn't stack overflow. Also any time I wanted to add an edge to an existing node, I had to extract the actual node from the graph first, then do all sorts of edge updates and that sort of thing to make sure everyone was holding on to the most recent version of the other thing. Also, because things were in a ref, determining whether something was connected to something else was a linear-time operation, which seemed inelegant. I didn't get very far before determining that actually performing any useful algorithms with this method would be difficult.
Then I tried another approach which is a variation of the matrix referred to elsewhere. The graph is a clojure map, where the keys are the nodes (not refs to nodes), and the values are another map in which the keys are the neighboring nodes and single value of each key is the edge to that node, represented either as a numerical value indicating the strength of the edge, or an edge structure which I defined elsewhere.
It looks like this, sort of, for 1->2, 1->3, 2->5, 5->2
(def graph {node-1 {node-2 edge12, node-3 edge13},
node-2 {node-5 edge25},
node-3 nil ;;no edge leaves from node 3
node-5 {node-2 edge52}) ;; nodes 2 and 5 have an undirected edge
To access the neighbors of node-1 you go (keys (graph node-1)) or call the function defined elsewhere (neighbors graph node-1), or you can say ((graph node-1) node-2) to get the edge from 1->2.
Several advantages:
Constant time lookup of a node in the graph and of a neighboring node, or return nil if it doesn't exist.
Simple and flexible edge definition. A directed edge exists implicitly when you add a neighbor to a node entry in the map, and its value (or a structure for more information) is provided explicitly, or nil.
You don't have to look up the existing node to do anything to it. It's immutable, so you can define it once before adding it to the graph and then you don't have to chase it around getting the latest version when things change. If a connection in the graph changes, you change the graph structure, not the nodes/edges themselves.
This combines the best features of a matrix representation (the graph topology is in the graph map itself not encoded in the nodes and edges, constant time lookup, and non-mutating nodes and edges), and the adjacency-list (each node "has" a list of its neighboring nodes, space efficient since you don't have any "blanks" like a canonical sparse matrix).
You can have multiples edges between nodes, and if you accidentally define an edge which already exists exactly, the map structure takes care of making sure you are not duplicating it.
Node and edge identity is kept by clojure. I don't have to come up with any sort of indexing scheme or common reference point. The keys and values of the maps are the things they represent, not a lookup elsewhere or ref. Your node structure can be all nils, and as long as it's unique, it can be represented in the graph.
The only big-ish disadvantage I see is that for any given operation (add, remove, any algorithm), you can't just pass it a starting node. You have to pass the whole graph map and a starting node, which is probably a fair price to pay for the simplicity of the whole thing. Another minor disadvantage (or maybe not) is that for an undirected edge you have to define the edge in each direction. This is actually okay because sometimes an edge has a different value for each direction and this scheme allows you to do that.
The only other thing I see here is that because an edge is implicit in the existence of a key-value pair in the map, you cannot define a hyperedge (ie one which connects more than 2 nodes). I don't think this is a big deal necessarily since most graph algorithms I've come across (all?) only deal with an edge that connects 2 nodes.
I ran into this challenge before and concluded that it isn't possible using truly immutable data structures in Clojure at present.
However you may find one or more of the following options acceptable:
Use deftype with ":unsynchronized-mutable" to create a mutable :edges field in each node that you change only once during construction. You can treat it as read-only from then on, with no extra indirection overhead. This approach will probably have the best performance but is a bit of a hack.
Use an atom to implement :edges. There is a bit of extra indirection, but I've personally found reading atoms to be extremely efficient.