Data structure to represent adjacency list for graph, when receiving ambiguous specs - c++

I am designing an application which should be based on graphs.
I am not sure which is the best way to represent the graph adjacency list in memory. The requirements from the customer are quite vague, so I must make some several assumptions. The nodes of the graphs are some IDs, but I am not sure if the IDs are sequential or not. What does the graph theory say, when it comes to general specifications?
If they are sequential, the number of nodes (N) should also limit the max IDs, and basically it is ensured that the IDs will cover the interval 1,2…N. See option A below.
If they are not sequential, the IDs could jump from 1 to e.g. 11, and may skip some natural numbers in the specification. See option B below.
Beside ID, there is also a c++ data structure, where I store multiple info ( payload, connected edges etc.)
There are two options left for my algorithm:
A. Represent the graph as a vector< Data > , and index of vector will mean to the nodeID.
B. Represent the graph as a map , where Node ID is the key, and Data is the storage value.
Map would allow me having random IDs, let’s say that the input data is given randomly.
The literature ( e.g. DFS, BFS or other graph articles) is mostly considering option A, where node IDs fully cover an interval [1..N]. I would also go for this option, as it represents a commonly agreed notation.
Then, add this to the documentation/precondition section of my application.
What is the best option to proper cover customer’s ambiguous specifications?

You could choose to represent a your graph as a combination of your two listed options: have a Node structure that contains two members - an integer label and a the other struct you need.
The graph will store a std::vector<Node*> nodes;. However, given the restriction that a node's label will not match its position in the above vector, you will need to store the correspondence between label and vector indexes in a std::map<int, int> corresp;
Given this structure, if you need to access the Node* with a label value of 11, you would do Node* node = nodes[corresp[label]];
Also, the label could be any other type, for instance a std::string. The only modification that needs to be done is to change the key type of the map to std::string.

Case 1: sequential IDs. Then you may store the nodes in an array in such a way that the indexes correspond to the IDs.
Case 2: sparse IDs.
Usually the representation of the nodes of a graph allows them to have a payload (attributes), such as the ID. If you don't need to access the nodes by ID, use an array and you are done.
If you do need to access the nodes by ID, use a dictionary (map) to establish the correspondence. You can also store the nodes directly in the dictionary, but node enumeration or sorting will be harder.

I usually recommend identifying things with (maybe smart) pointers if they are objects, since that's the mechanism that C/C++ provides to identify objects.
Fundamentally, your graph consists of a number of nodes and edges, so you would generally have something like:
class Node {
int id;
Data data;
std::vector<Node *> edges;
}
Then, in your Graph class, you will need some kind of map for every other way you need to access nodes. You will probably need to be able to find nodes by id, so the graph class will need some kind of index for that -- a vector<Node *> nodesById for dense ids or a map<int,Node*> nodesById for sparse ids. Which one to choose should not be an important decision that has a lot of consequences. Add a method Node *getNodeById(int id), and then you can change the representation whenever you want. Always remember that, in software development, when a decision doesn't have an obvious answer, or when the best answer is likely to change in the future, then making it easy to change your mind is much better than making the right choice.
As people add requirements to your graph, you may need to access nodes in different ways and may have to add more kinds of indexes to support those particular use cases.
Two jobs you will need to do with your graph are construction and destruction. Construction will probably require that nodesById index. Destruction will definately require some way to enumerate all the nodes, and whichever representation you choose for nodesById will suffice for that as well.

You could use a map of vectors. Something like this:
Map<int,vector<Node *>>;
The key in this map would be your node id. The corresponding vector has the first entry as your corresponding node of that particular Id and then all the edges from that Id node.
Suppose, your graph has a node with id 2, and this node has its edges with nodes with id 3,4 and 6.
So your entry corresponding to the key 2 in your map would be a vector, that has its first entry as node with id 2, then next entry as node of id 3, then with 4 and then at last with node 6.
Your each vector entry of Node could look similar to this:
struct Node {
int id,
InfoData obj;
}

Related

Graph data structure memory management

I would like to implement a custom graph data structure for my project and I had a question about proper memory management.
Essentially, the data structure will contain nodes that have two vectors: one for edges coming into the node and one for edges coming out of the node (no looped edges). The graph is connected. The graph will also contain one 'entry' node that will have no edges coming into it. An edge is simply a pointer to a node.
My question here is: What would be the best method of clearing up memory for this type of data structure? I understand how to do it if there was only one entry edge (at which point this structure degenerates to a n-ary tree), but I'm not sure what to do in the case where there are multiple nodes that have edges going into a single node. I can't just call delete from an arbitrary entry node because this will likely result in 'double free' bugs later on.
For example, suppose I had this subgraph:
C <-- B
^
|
A
If I were to call delete from node B, I would remove the memory allocated for C, but A would still have a pointer to it. So if I wanted to clear all the nodes A had connections to, I would get a double free error.
You will need to perform a search to figure out which node is still connected to the input edge, when you remove a component. If you end up with more than one connected group, you will need to figure out which one of these contains the entry node and remove all others.
No greedy (local) algorithm for this can exist, which can be shown by a simple thought experiment:
Let A, B be subgraphs connected only through the node n, which shall be removed. We are left with two unconnected subgraphs. There is no way of knowing (without a whole bunch of state per node) if we have just removed the only route to the entry node for A or B. And, it is necessary to figure that out, so that the appropriate choice of removing either A or B can be made.
Even if every node stored every single route to the entry node, it would mean you have to clean up all routes in all nodes whenever you remove a single node.
Solution Sketch
Let us talk about a graphical representation of what we need to do:
First, Color the node that is being deleted black. Then perform the following for every node we encounter:
For uncolored nodes:
If the node we came from is black, give this node a new color
If the node we came from is colored, give this node the same color
Travel through every outgoing edge
For colored nodes:
If the node we came from is black, just return
If the node we came from is the same color, just return
If the node we came from has a different color, merge the two colors (e.g. by remembering that green and blue are the same, or by painting every green node blue)
Travel through every outgoing edge
At the end we will know which connected components will exist after we delete the current node. All connected components (plus our original to be deleted node) which do not contain the entry node must be deleted (Note: This may delete every single node, if our to-be-deleted node was the entry node...)
Implementation
You will need a data structure like the following:
struct cleanup {
vector<set<node*>> colors;
node* to_be_deleted;
size_t entry_component;
};
The index into the vector of lists will be your "color". The "color black" will be represented by usage of to_be_deleted. Finally, the entry_component will contain the index of the color that has the entry node.
Now, the previous algorithm can be implemented. There are quite a few things to consider, and the implementation may end up being different, depending on what kind of support structures you already keep for other operations.
The answer depends on the complexity of the graph:
If the graph is a tree, each parent can own its children and delete them in its destructor.
If the graph is a directed acyclic graph, an easy and performant way to handle it is to do reference counting on the nodes.
If the graph can be cyclic, you are out of luck. You will need to keep track of each and every node in your graph, and then do garbage collection. Depending on your use case, you can either do the collection by
cleaning up everything when you are done with the complete graph, or by
repeatedly marking all connected nodes and cleaning up all the unreachable ones.
If there is any possibility to get away with option 1 or 2 (possibly tweaking the problem to ensure that the graph fulfills the constraint), you should do so; option 3 implies significant overheads in terms of code complexity and runtime.
There are a couple of ways. One way is to make your nodes know what other nodes have edges to it. So, if you delete C from B, C will need to remove the edge to it from A. So later when you remove/delete A, it won't try to delete C.
std::shared_ptr or some other type of reference counting may also work for you.
Here's a simple way to avoiding memory problems when implementing a graph: Don't use pointers to represent edges.
Instead, give each node a unique ID number (an incrementing integer counter will suffice). Keep a global unordered_map<int, shared_ptr<Node> > so that you can quickly look up any Node by its ID number. Then each Node can represent its edges as a set of integer Node IDs.
After you delete a Node (i.e. remove it from the global map of Nodes), it's possible that some other Nodes will now have "dangling edges", but that will be easy to detect and handle because when you go to look up the now-removed Node's ID in your global map, the lookup will fail. You can then gracefully respond by ignoring that edge, or by removing that edge its the source Node, or etc.
The advantages of doing it this way: The code remains very simple, and there is no need to worry about reference-cycles, memory leaks, or double-frees.
The disadvantages: It's a little bit less efficient to traverse the graph (since doing a map lookup takes more cycles than a simple pointer dereference) and (depending on what you are doing) the 'dangling edges' might require occasional cleanup sweeps (but those are easy enough to do... just iterate over the global map, and for each node, check each edge in its edge-set and remove the ones with IDs that aren't present in the global map)
Update: If you don't like doing a lot of unordered_map lookups, you could alternatively get very similar functionality by representing your edges using weak_ptr instead. A weak_ptr will automagically become NULL/invalid when the object it is pointing at goes away.

implementing bfs on negative numbered nodes

We can easily implement breadth first search algorithm if the nodes or vertices are numbered positive in C++.
But how to deal with it when the nodes or vertices are numbered negative.
Suppose a node is numbered -200 and if we assign bool visited[-200] = true or false then it would produce run time error.
What will be the approach in this case?
"if nodes are numbered negative" ~> in case you need each node to have a unique identifier that will be used to access these nodes within some containers, there's no reason why you would allow the value of such identifier to be negative.
Just like you pointed out: visited[-200] = true; makes no sense. In case you need each node to have a unique id / index of this kind, do it regardless the value stored within it, i.e.:
struct Node {
unsigned int id;
int value;
...
};
When dealing with graph labels it seems a reasonable approach is the use of property maps (see, e.g., the Boost Graph Library. The basic idea is to access the node properties via a structured approach where the details of the properties are actually access are subject to the concrete property map being used.
Based on what you are saying you are accessing the labels using a node ID. There are two obvious approaches with somewhat different characteristics depending on the actual layout of the IDs:
If the node labels are relatively dense and the range of value node IDs is known, using an array with a suitable adjustment using an offset is a reasonable approach. That is, you'd use an array like label[nodeID - minNodeID].
If the node labels are randomly distributed, you'd use an associative container with the node ID as the container's key.
If you can control the layout of your node representation, you should probably consider storing the labels in the node, especially if the node IDs are randomly distributed.

How to convert a struct property to a pointer reference in C++?

I have a DAG in a JSON format, where each node is an entry: it has a name, and two arrays. One array is for other nodes with arrows coming into it, another array for nodes that this node is directed towards (outgoing arrows).
So, for example:
{
'id': 'A',
'connected_from' : ['B','C'],
'connects_to' : ['D','E']
}
And I have a collection of these nodes, that all together form a DAG.
I'd like to map the nodes to a struct to hold these nodes, where the id is simply a string, and I'd like the arrays to be vectors of pointers of this struct:
struct node {
string id;
vector<node*> connected_from;
vector<node*> connected_to;
}
How do I convert the node entries as 'id' in the arrays of the JSON to a pointer to the correct struct holding that node?
One obvious approach is to build a map of key-value pairs, where key = id, value = the pointer to the struct with that id, and do a lookup - but is there a better way?
no, given only the information that you've provided there isn't a better way: you need to build a map.
however, for single letter id's the map can possibly take the form of a simple array with e.g. 26 entries for the English alphabet.
There's going to be some container object holding all the nodes (otherwise you're going to leak them.) You could always scan over the container to find the nodes. But this will be inefficient - O(N^2) while a map lookup will be O(N log N ).
Though if you store the objects in sorted order in the container (or use a sorted container) you can reduce both cases to O(N log N).
The constants will be different though, so for a small graph the scan may be faster.
I think your suggestion is fine... Map from ID to node. It's simple, intuitive and fast enough for practical purposes. Considering the data is being parsed from JSON, your storage and lookups are not going to significantly impact performance. If you're really concerned, then implement a Dictionary to replace your map.
In general terms, I always advocate the simplest, cleanest approach that gets the job done. Too many people obsess about memory or performance hits in algorithms, when the actual bottleneck in their code lies elsewhere.

How can one create cyclic (and immutable) data structures in Clojure without extra indirection?

I need to represent directed graphs in Clojure. I'd like to represent each node in the graph as an object (probably a record) that includes a field called :edges that is a collection of the nodes that are directly reachable from the current node. Hopefully it goes without saying, but I would like these graphs to be immutable.
I can construct directed acyclic graphs with this approach as long as I do a topological sort and build each graph "from the leaves up".
This approach doesn't work for cyclic graphs, however. The one workaround I can think of is to have a separate collection (probably a map or vector) of all of the edges for an entire graph. The :edges field in each node would then have the key (or index) into the graph's collection of edges. Adding this extra level of indirection works because I can create keys (or indexes) before the things they (will) refer to exist, but it feels like a kludge. Not only do I need to do an extra lookup whenever I want to visit a neighboring node, but I also have to pass around the global edges collection, which feels very clumsy.
I've heard that some Lisps have a way of creating cyclic lists without resorting to mutation functions. Is there a way to create immutable cyclic data structures in Clojure?
You can wrap each node in a ref to give it a stable handle to point at (and allow you to modify the reference which can start as nil). It is then possible to possible to build cyclic graphs that way. This does have "extra" indirection of course.
I don't think this is a very good idea though. Your second idea is a more common implementation. We built something like this to hold an RDF graph and it is possible to build it out of the core data structures and layer indices over the top of it without too much effort.
I've been playing with this the last few days.
I first tried making each node hold a set of refs to edges, and each edge hold a set of refs to the nodes. I set them equal to each other in a (dosync... (ref-set...)) type of operation. I didn't like this because changing one node requires a large amount of updates, and printing out the graph was a bit tricky. I had to override the print-method multimethod so the repl wouldn't stack overflow. Also any time I wanted to add an edge to an existing node, I had to extract the actual node from the graph first, then do all sorts of edge updates and that sort of thing to make sure everyone was holding on to the most recent version of the other thing. Also, because things were in a ref, determining whether something was connected to something else was a linear-time operation, which seemed inelegant. I didn't get very far before determining that actually performing any useful algorithms with this method would be difficult.
Then I tried another approach which is a variation of the matrix referred to elsewhere. The graph is a clojure map, where the keys are the nodes (not refs to nodes), and the values are another map in which the keys are the neighboring nodes and single value of each key is the edge to that node, represented either as a numerical value indicating the strength of the edge, or an edge structure which I defined elsewhere.
It looks like this, sort of, for 1->2, 1->3, 2->5, 5->2
(def graph {node-1 {node-2 edge12, node-3 edge13},
node-2 {node-5 edge25},
node-3 nil ;;no edge leaves from node 3
node-5 {node-2 edge52}) ;; nodes 2 and 5 have an undirected edge
To access the neighbors of node-1 you go (keys (graph node-1)) or call the function defined elsewhere (neighbors graph node-1), or you can say ((graph node-1) node-2) to get the edge from 1->2.
Several advantages:
Constant time lookup of a node in the graph and of a neighboring node, or return nil if it doesn't exist.
Simple and flexible edge definition. A directed edge exists implicitly when you add a neighbor to a node entry in the map, and its value (or a structure for more information) is provided explicitly, or nil.
You don't have to look up the existing node to do anything to it. It's immutable, so you can define it once before adding it to the graph and then you don't have to chase it around getting the latest version when things change. If a connection in the graph changes, you change the graph structure, not the nodes/edges themselves.
This combines the best features of a matrix representation (the graph topology is in the graph map itself not encoded in the nodes and edges, constant time lookup, and non-mutating nodes and edges), and the adjacency-list (each node "has" a list of its neighboring nodes, space efficient since you don't have any "blanks" like a canonical sparse matrix).
You can have multiples edges between nodes, and if you accidentally define an edge which already exists exactly, the map structure takes care of making sure you are not duplicating it.
Node and edge identity is kept by clojure. I don't have to come up with any sort of indexing scheme or common reference point. The keys and values of the maps are the things they represent, not a lookup elsewhere or ref. Your node structure can be all nils, and as long as it's unique, it can be represented in the graph.
The only big-ish disadvantage I see is that for any given operation (add, remove, any algorithm), you can't just pass it a starting node. You have to pass the whole graph map and a starting node, which is probably a fair price to pay for the simplicity of the whole thing. Another minor disadvantage (or maybe not) is that for an undirected edge you have to define the edge in each direction. This is actually okay because sometimes an edge has a different value for each direction and this scheme allows you to do that.
The only other thing I see here is that because an edge is implicit in the existence of a key-value pair in the map, you cannot define a hyperedge (ie one which connects more than 2 nodes). I don't think this is a big deal necessarily since most graph algorithms I've come across (all?) only deal with an edge that connects 2 nodes.
I ran into this challenge before and concluded that it isn't possible using truly immutable data structures in Clojure at present.
However you may find one or more of the following options acceptable:
Use deftype with ":unsynchronized-mutable" to create a mutable :edges field in each node that you change only once during construction. You can treat it as read-only from then on, with no extra indirection overhead. This approach will probably have the best performance but is a bit of a hack.
Use an atom to implement :edges. There is a bit of extra indirection, but I've personally found reading atoms to be extremely efficient.

effective C++ data structure to consider in this case

Greetings code-gurus!
I am writing an algorithm that connects, for instance node_A of Region_A with node_D of Region_D. (node_A and node_D are just integers). There could be 100k+ such nodes.
Assume that the line segment between A and D passes through a number of other regions, B, C, Z . There will be a maximum of 20 regions in between these two nodes.
Each region has its own properties that may vary according to the connection A-D. I want to access these at a later point of time.
I am looking for a good data structure (perhaps an STL container) that can hold this information for a particular connection.
For example, for connection A - D I want to store :
node_A,
node_D,
crosssectional area (computed elsewhere) ,
regionB,
regionB_thickness,
regionB other properties,
regionC, ....
The data can be double , int , string and could also be an array /vector etc.
First I considered creating structs or classes for regionB, regionC etc .
But, for each connection A-D, certain properties like thickness of the region through which this connection passes are different.
There will only be 3 or 4 different things I need to store pertaining to a region.
Which data structure should I consider here (any STL container like vector?) Could you please suggest one? (would appreciate a code snippet)
To access a connection between nodes A-D, I want to make use of int node_A (an index).
This probably means I need to use a hashmap or similar data structure.
Can anyone please suggest a good data structure in C++ that can efficiently
hold this sort of data for connection A -D described above? (would appreciate a code snippet)
thank you!
UPDATE
for some reasons, I can not make use of pkgs like boost. So want to know if I can use any libraries from STL
You should try to group stuff together when you can. You can group the information on each region together with something like the following:
class Region_Info {
Region *ptr;
int thickness;
// Put other properties here.
};
Then, you can more easily create a data structure for your line segment, maybe something like the following:
class Line_Segment {
int node_A;
int node_D;
int crosssectional_area;
std::list<Region_Info>;
};
If you are limited to only 20 regions, then a list should work fine. A vector is also fine if you would prefer.
Have you considered a adjacency array for each node, which stores the nodes it is connected to, along with other data?
First, define a node
class Node
{
int id_;
std::vector<AdjacencyInfo> adjacency_;
}
Where the class AdjacencyInfo can store the myriad data which you need. You can change the Vector to a hashmap with the node id as the key if lookup speed is an issue. For fancy access you may want to overload the [] operator if it is an essential requirement.
So as an example
class Graph
{
std::map<int, Node> graph_;
}
boost has a graph library: boost.graph. Check it out if it is useful in your case.
Well, as everyone else has noticed, that's a graph. The question is, is it a sparse graph, or a dense one? There are generally two ways of representing graphs (more, but you'll probably only need to consider these two) :
adjacency matrix
adjacency list
An adjacency matrix is basically a NxN matrix which stores all the nodes in the first row and column, and connection data (edges) as cells, so you can index edges by vertices. Sorry if my English sucks, not my native language. Anyway, you should only consider adjacency matrix if you have a dense graph, and need to find node->edge->node connections really fast. However, iterating through neighbours or adding/removing vertices in an adjacency matrix is slow, the first requiring N iterations, and the second resizing the array/vector you use to store the matrix.
Your other option is to use an adjacency list. Basically, you have a class that represents a node, and one that represents an edge, that stores all the data for that edge, and two pointers that point to the nodes it's connected to. The node class has a collection of some sort (a list will do), and keeps track of all the edges it's connected to. Then you'll need a manager class, or simply a bunch of functions that operate on your nodes. Adding/connecting nodes is trivial in this case as is listing neighbours or connected edges. However, it's harder to iterate over all the edges. This structure is more flexible than the adjacency matrix and it's better for sparse graphs.
I'm not sure that I understood your question completely, but if I did, I think you'd be better off with an adjacency matrix, seems like you have a dense graph with lots of interconnected nodes and only need connection info.
Wikipedia has a good article on graphs as a data structure, as well as good references and links, and finding examples shouldn't be hard. Hope this helps :
Link