Suitable data structure for large graphs

Suitable data structure for large graphs - c++

I have a large graph, is there any other data structure other than adjacency list and "adjacency matrix" in c++ stl or some other data structure which I can employ for such a large graph, actually the adjacency matrix of my graph does not fit in the main memory. My graph is directed and I am implementing dijkstra algorithm in C++.
I have seen the previous posts...but I am searching for a suitable data structure with respect to dijkstra.
By large I mean a graph containing more than 100 million nodes and edges.

It's common to represent adjacency lists as lists of integers, where the integer is the index of a node. How about getting some more space efficiency by instead treating the adjacency list as a bit string 00010111000... where a 1 in nth position represents an edge between this node and node n? Then compress the bitstring by some standard algorithm; uncompress it as you need it. The bit strings will probably compress pretty well, so this trades space efficiency for higher computational cost.

Related

Efficient Transitive Reduction of Adjacency List DAG

I have a large directed acyclic graph, and I'd like to compute the transitive reduction of that graph.
I'm currently computing the transitive reduction using a naive depth-first search, but that algorithm is too slow for my use case. However, the efficient algorithms I've been able to find work on an adjacency matrix representation, whereas my representation is roughly equivalent to an adjacency list. (It's actually represented as a set of C++ objects, each with pointers to their children and parents).
I obviously could transform my DAG into an adjacency matrix, do the reduction, and transform it back; but that seems a bit wasteful, and I'd like a simpler algorithm if possible.
My graph contains ~100,000 nodes.

Merits of implementing tree by array over pointer?

Recently I learnt implementing tree by struct such as
struct Node{
Node *parent;
vector<Node*> child;
Node(void):parent(nullptr){}
}
I thought this is a pretty straight forward way to implement tree,
and it is also easier to include more stuff for process within the struct.
However, I noticed in many people's code,
they prefer using an array instead of pointer.
I could understand this for Binary Tree as it is pretty easy to do it by array too
, but why on other more complex graph?

From Skiena:2008:ADM:1410219 and Cormen:2001:IA:580470 comparing adjacency matrices and adjacency lists for graphs yields that:
Adjacency matrices are faster for testing if (x, y), two nodes, have a connecting edge
Adjacency lists are faster for finding the degree (the amount of neighbours) of a given node.
A graph with m nodes and n edges consumes m + n space if implemented using adjacency lists compared to n^2 for adjacency matrices.
Adjacency matrices uses slightly less memory for big graphs.
Edge insertion/deletion performs in O(1) when using adjacency matrices.
Traversing a graph implemented using adjacency lists performs in Θ(m + n) and matrices Θ(n^2)
If a graph has many vertices but few edges adjacency matrices consume excessive memory.
So in general adjacency lists perform better.

Does time complexity of dijkstra's algorithm for shortest path depends on data structure used?

One way to store the graph is to implement nodes as structures, like
struct node {
int vertex; node* next;
};
where vertex stores the vertex number and next contains link to the other node.
Another way I can think of is to implement it as vectors, like
vector<vector< pair<int,int> > G;
Now, while applying Dijkstra's algorithm for shortest path, we need to build priority queue and other required data structures and so as in case 2 (vector implementation).
Will there be any difference in complexity in above two different methods of applying graph? Which one is preferable?
EDIT:
In first case, every node is associated with a linked list of nodes which are directly accessible from the given node. In second case,
G.size() is the number of vertices in our graph
G[i].size() is the number of vertices directly reachable from vertex with index i
G[i][j].first is the index of j-th vertex reachable from vertex i
G[i][j].second is the length of the edge heading from vertex i to vertex G[i][j].first

Both are adjacency list representations. If implemented correctly, that would be expected to result in the same time complexity. You'd get a different time complexity if you use an adjacency matrix representation.
In more detail - this comes down to the difference between an array (vector) and a linked-list. When all you're doing is iterating through the entire collection (i.e. the neighbours of a vertex), as you do in Dijkstra's algorithm, this takes linear time (O(n)) regardless of whether you're using an array or linked-list.
The resulting complexity for running Dijkstra's algorithm, as noted on Wikipedia, would be
O(|E| log |V|) with a binary heap in either case.

Fastest way to run prim's on a growing range of coordinates

I was hoping someone could give me a general method for computing the MST for a problem that works from input that is formatted as such:
<number of vertices>
<x> <y>
<x> <y>
...
I understand how to implement prim's algorithm, but I was looking for a method that (using prim's algorithm) will require the least amount of memory/time to execute. Should I store everything in an adjacency matrix? If the number of vertices grows to say, 10,000, what is the optimal way to solve this problem (assuming prim's is used)?

You really need to use Prim's?
A simple way is use Kruskal algorithm to recompute the spanning tree (using only previously selected edges) every time you add a node. Since Kruskal is O(E log E) and in every iteration you'll have exactly 2*V-1 edges to compute (V-1 from previous tree + V from newly added node). You'll need O(V log V) for each insertion.

Prim's algoritm is faster if you have a dense graph (a graph that has a lot of edges). If you use an adjacency matrix, the complexity of Prim's algoritm would be O(|V|^2).
This can be improved by using a binary heap data structure with the graph represented by an adjacency list. Using this method, the complexity would be O(|E|log|V|).
Using a fibonacci heap data structure with an adjacency list would be even faster with a complexity of O(|E| + |V|log|V|).
Note: E refers to the number of edges in the graph, while V refers to the number of vertexes in the graph.
The STL has already implemented a binary heap data structure, std::priority_queue. A std::priority_queue calls the heap algoritms in the algoritm library. You could also use a std::vector (or any other container that has random access iterators) and call make_heap, push_heap, pop_heap, etc. These are all in the algoritm library. More info here: http://www.cplusplus.com/reference/algorithm/.
You could also implement your own heap data structure, but that may be too complicated and not worth the performance benefits.

Finding edge in weighted graph

I have a graph with four nodes, each node represents a position and they are laid out like a two dimensional grid. Every node has a connection (an edge) to all (according to the position) adjacent nodes. Every edge also has a weight.
Here are the nodes represented by A,B,C,D and the weight of the edges is indicated by the numbers:
A 100 B
120 220
C 150 D
I want to structure a container and an algorithm that switches the nodes sharing the edge with the highest weight. Then reset the weight of that edge. No node (position) can be switched more than once each time the algorithm is executed.
For example, processing the above, the highest weight is on edge BD, so we switch those. Since no node can be switched more than once, all edges involved in either B or D is reset.
A D
120
C B
Then, the next highest weight is on the only edge left, switching those would give us the final layout: C,D,A,B.
I'm currently running a quite awful implementation of this. I store a long list of edges, holding four values for the nodes they are (potentially) connected to, a value for its weight and the position for the node itself. Every time anything is requested, I loop through the entire list.
I'm writing this in C++, could some parts of the STL help speed this up? Also, how to avoid the duplication of data? A node position is currently in five objects. The node itself that is there and the four nodes indicating a connection to it.
In short, I want help with:
Can this be structured in a way so that there is no data duplication?
Recognise the problem? If any of this has a name, tell me so I can google for more info on the subject.
Fast algorithms are always nice.

As for names, this is a vertex cover problem. Optimal vertex cover is NP-hard with decent approximation solutions, but your problem is simpler. You're looking at a pseudo-maximum under a tighter edge selection criterion. Specifically, once an edge is selected every connected edge is removed (representing the removal of vertices to be swapped).
For example, here's a standard greedy approach:
0) sort the edges; retain adjacency information
while edges remain:
1) select the highest edge
2) remove all adjacent edges from the list
endwhile
The list of edges selected gives you the vertices to swap.
Time complexity is O(Sorting vertices + linear pass over vertices), which in general will boil down to O(sorting vertices), which will likely by O(V*log(V)).
The method of retaining adjacency information depends on the graph properties; see your friendly local algorithms text. Feel free to start with an adjacency matrix for simplicity.
As with the adjacency information, most other speed improvements will apply best to graphs of a certain shape but come with a tradeoff of time versus space complexity.
For example, your problem statement seems to imply that the vertices are laid out in a square pattern, from which we could derive many interesting properties. For example, that system is very easily parallelized. Also, the adjacency information would be highly regular but sparse at large graph sizes (most vertices wouldn't be connected to each other). This makes the adjacency matrix give a high overhead; you could instead store adjacency in an array of 4-tuples as it would retain fast access but almost entirely eliminate overhead.

If you have bigger graphs look into the boost graph library. It gives you good data structures for graphs and basic iterators for different types of graph traversing

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js