Hashmap to implement adjacency lists - c++

I've implement an adjacency list using the vector of vectors approach with the nth element of the vector of vectors refers to the friend list of node n.
I was wondering if the hash map data structure would be more useful. I still have hesitations because I simply cannot identify the difference between them and for example if I would like to check and do an operation in nth elements neighbors (search,delete) how could it be more efficient than the vector of vectors approach.

A vector<vector<ID>> is a good approach if the set of nodes is fixed. If however you suddenly decide to remove a node, you'll be annoyed. You cannot shrink the vector because it would displace the elements stored after the node and you would lose the references. On the other hand, if you keep a list of free (reusable) IDs on the side, you can just "nullify" the slot and then reuse later. Very efficient.
A unordered_map<ID, vector<ID>> allows you to delete nodes much more easily. You can go ahead and assign new IDs to the newly created nodes and you will not be losing empty slots. It is not as compact, especially on collisions, but not so bad either. There can be some slow downs on rehashing when a vector need be moved with older compilers.
Finally, a unordered_multimap<ID, ID> is probably one of the easiest to manage. It also scatters memory to the wind, but hey :)
Personally, I would start prototyping with a unordered_multimap<ID, ID> and switch to another representation only if it proves too slow for my needs.
Note: you can cut in half the number of nodes if the adjacency relationship is symmetric by establishing than the relation (x, y) is stored for min(x, y) only.

Vector of vectors
Vector of vectors is good solution when you don't need to delete edges.
You can add edge in O(1), you can iterate over neighbours in O(N).
You can delete edge by vector[node].erase(edge) but it will be slow, complexity only O(number of vertices).
Hash map
I am not sure how you want to use hash map. If inserting edge means setting hash_map[edge] = 1 then notice that you are unable to iterate over node's neighbours.

Related

Is this a bad way to implement a graph?

When I look at a book, I only show examples of how to implement graphs in almost every book by adjacent matrix method and adjacent list method.
I'm trying to create a node-based editor, in which case the number of edges that stretch out on each node is small, but there's a lot of vertex.
So I'm trying to implement the adjacent list method rather than the adjacent matrix method.
However, adjacent lists store and use each edge as a connection list.
But, I would like to use the node in the form listed below.
class GraphNode
{
int x, y;
dataType data;
vector<GraphNode*> in;
vector<GraphNode*> out;
public:
GraphNode(var...) = 0;
};
So like this, I want to make the node act as a vertex and have access to other nodes that are connected.
Because when I create a node-based editor program, I have to access and process different nodes that are connected to each node.
I want to implement this without using a linked list.
And, I'm going to use graph algorithms in this state.
Is this a bad method?
Lastly, I apologize for my poor English.
Thank you for reading.
You're just missing the point of the difference between adjacency list and adjacency matrix. The main point is the complexity of operations, like finding edges or iterating over them. If you compare a std::list and a std::vector as datatype implementing an adjacency list, both have a complexity of O(n) (n being the number of edges) for these operations, so they are equivalent.
Other considerations:
If you're modifying the graph, insertion and deletion may be relevant as well. In that case, you may prefer a linked list.
I said that the two are equivalent, but generally std::vector has a better locality of reference and less memory overhead, so it performs better. The general rule in C++ is to use std::vector for any sequential container, until profiling has shown that it is a bottleneck.
Short answer: It is probably a reasonable way for implementing a graph.
Long answer: What graph data structure to use is always dependent on what you want to use it for. A adjacency matrix is good for very dense graphs were it will not waste space due to many 0 entries and if we want to answer the question "Is there an edge between A and B?" fast. The iteration over all members of a node can take pretty long, since it has to look at a whole row and not just the neighbors.
An adjacency list is good for sparse graphs and if we mostly want to look up all neighbors of a node (which is very often the case for graph mustering algorithms). In a directed graph were we want to treat ingoing and outgoing edges seperately, it is probably a good idea to have a seperate adjacency list for ingoing and outgoing egdes (as has your code).
Regarding what container to use for the list, it depends on the use case. If you will much more often iterate over the graph and not so often delete something from it, using a vector over a list is a very good idea (basically all graph programms I ever wrote were of this type). If you have a graph that changes very often, you have to delete edges very often, you don't want to have iterator invalidation and so on, maybe it is better having a list. But that is very seldom the case.
A good design would be to make it very easy to change between list and vector so you can easily profile both and then use what is better for your program.
Btw if you often delete one edge, this is also pretty easily done fast with a vector, if you do not care about the order of your edges in adjacency list (so do not do this without thinking while iterating over the vector):
void delte_in_edge(size_t index) {
std::swap(in[i], in.back()); // The element to be deleted is now at the last position,
// the formerly last element is at position i
in.pop_back(); // Delete the current last element
}
This has O(1) complexity (and the swap is probably pretty fast).

what are the containers in C++ STL to store a small number of integer values and find them in O(1)

Suppose, I want to create a vector of vectors to store/find the edges between the nodes in a graph. There are many points in the graph which doesn't have any edge and I don't want to save them. e.g. there are 2 millions nodes which 1.5 million of them don't have any edge.
Moreover, each node which I save could have 1 to couple hundreds edges.
After, I saved all the edges, I want to remove the edges which are not exist in both direction. So, if edge (i,j) exist but edge(j,i) doesn't exist, I want to erase the (i,j).
I used "vector of vector" to communicate what I want to create and I know it doesn't scale as it would be completely dense.
So, using vector of vectors format, I start to go through the first vector(suppose it is i) and for each item in it's the second vector (suppose it is j), I need to check if there is i in the second vector of j'th first vector. Which I need to be fast and preferably constant time. Something like hash table which I think std::set might help.
At this point, if the other edge (j,i) does not exist, I need to remove the current edge (i,j).
What would be a good container for my scenario?

construct boost priority queue based on iterators

I have a list of binomial_heaps and each iteration of the algorithm I have to update the priority of an element in some of the binomial_heaps. For this I use the update function of the boost binomial_heap. However one of the binomial_heaps I have to remove and rebuild completely (as all priorities change). Instead of using push every time (which if I understand correctly would have a complexity of n*log(n)) I would like to construct it based on iterators of an underlying container (a kind of heapify or make_heap operation which would be linear time). This seems possible in the standard priority_queue, but not in the boost implementation. On the other hand the standard one does not provide me with an update function. Is there a way around this where I can have both, or another library that supports both. Or maybe my reasoning, that pushing all elements on an empty priority queue is slower, is not correct?
Some might say there is something seriously wrong with the fact that I need to rebuild an entire priority queue which would make the use of the priority queue completely superfluous. The algorithm I want to implement is "Finding community structure in very large networks by Aaron Clauset" in which the authors do exactly that (unless I didn't interpret it correctly)
(Sorry couldn't post the link to the paper as I don't have enough reputation to post more than 2 links)
The "fast modularity" algorithm by Clauset et al. (paper here, code here) uses a pair of linked data structures. On the one hand, you have a sparse matrix data structure (which is really just an adjacency list in which instead of storing the elements hanging off a particular array element as a linked list, we store them using a balanced binary tree data structure), and a max-heap. All the values in the sparse matrix (which are really the dQ_ij values for the potential merges in the algorithm) are also stored in the max-heap.
So, the max-heap is just an efficient way of finding the edge in the sparse matrix with the most positive value. Once you have the ij pair for that edge, you want to "insert" the elements of column (row) i into the elements of column (row) j, and then you want to delete column (row) i. So, you're not going to rebuild the entire max-heap after each pop from the max-heap. Instead, you want to delete some elements from it (the ones in the row/column that you delete from the sparse matrix) and update the values of others (the ones in the updated row/column for j).
This is where the linked data structure is helpful -- in the original implementation, each element in the sparse matrix stores a pointer to its corresponding entry in the max-heap so that if you update the value in the sparse matrix, you can then find the corresponding element in the max-heap and update its value. Once you do this, you need to re-heapify the updated heap element, by letting it move (recursively) up or down in the heap. Similarly, if you delete an element in the sparse matrix, you can find its entry in the heap and call a delete function on it.

Graph memory implementation

The two ways commonly used to represent a graph in memory are to use either an adjacency list or and adjacency matrix. An adjacency list is implemented using an array of pointers to linked lists. Is there any reason that that is faster than using a vector of vectors? I feel like it should make searching and traversals faster because backtracking would be a lot simpler.
The vector of linked adjacencies is a favorite textbook meme with many variations in practice. Certainly you can use vectors of vectors. What are the differences?
One is that links (double ones anyway) allow edges to be easily added and deleted in constant time. This obviously is important only when the edge set shrinks as well as grows. With vectors for edges, any individual operation may require O(k) where k is the incident edge count.
NB: If the order of edges in adjacency lists is unimportant for your application, you can easily get O(1) deletions with vectors. Just copy the last element to the position of the one to be deleted, then delete the last! Alas, there are many cases (e.g. where you're worried about embedding in the plane) when order of adjacencies is important.
Even if order must be maintained, you can arrange for copying costs to amortize to an average that is O(1) per operation over many operations. Still in some applications this is not good enough, and it requires "deleted" marks (a reserved vertex number suffices) with compaction performed only when the number of marked deletions is a fixed fraction of the vector. The code is tedious and checking for deleted nodes in all operations adds overhead.
Another difference is overhead space. Adjacency list nodes are quite small: Just a node number. Double links may require 4 times the space of the number itself (if the number is 32 bits and both pointers are 64). For a very large graph, a space overhead of 400% is not so good.
Finally, linked data structures that are frequently edited over a long period may easily lead to highly non-contiguous memory accesses. This decreases cache performance compared to linear access through vectors. So here the vector wins.
In most applications, the difference is not worth worrying about. Then again, huge graphs are the way of the modern world.
As others have said, it's a good idea to use a generalized List container for the adjacencies, one that may be quickly implemented either with linked nodes or vectors of nodes. E.g. in Java, you'd use List and implement/profile with both LinkedList and ArrayList to see which works best for your application. NB ArrayList compacts the array on every remove. There is no amortization as described above, although adds are amortized.
There are other variations: Suppose you have a very dense graph, where there's a frequent need to search all edges incident to a given node for one with a certain label. Then you want maps for the adjacencies, where the keys are edge labels. Of course the maps can be hashes or trees or skiplists or whatever you like.
The list goes on. How to implement for efficient vertex deletion? As you might expect, there are alternatives here, too, each with advantages and disadvantages.

Fast bucket implementation

In a graph class I need to handle nodes with integer values (1-1000 mostly). In every step I want to remove a node and all its neighbors from the graph. Also I want to always begin with the node of the minimal value. I thought long about how to do this in the fastest possible manner and decided to do the following:
The graph is stored using adjancency lists
There is a huge array std::vector<Node*> bucket[1000] to store the nodes by its value
The index of the lowest nonempty bucket is always stored and kept track off
I can find the node of minimal value very fast by picking a random element of that index or if the bucket is already empty increase the index
Removing the selected node from the bucket can clearly done in O(1), the problem is that for removing the neighbors I need to search the bucket bucket[value of neighbor] first for all neighbor nodes, which is not really fast.
Is there a more efficient approach to this?
I thought of using something like std::list<Node*> bucket[1000], and assign every node a pointer to its "list element", such that I can remove the node from the list in O(1). Is this possible with stl lists, clearly it can be done with a normal double linked list that I could implement by hand?
I recently did something similar to this for a priority queue implementation using buckets.
What I did was use a hash tables (unordered_map), that way, you don't need to store 1000 empty vectors and you still get O(1) random access (general case, not guaranteed). Now, if you only need to store/create this graph class one time, it probably doesn't matter. In my case I needed to create the priority queue tens/hundreds of time per second and using the hash map made a huge difference (due to the fact that I only created unordered_sets when I actually had an element of that priority, so no need to initialize 1000 empty hash sets). Hash sets and maps are new in C++11, but have been available in std::tr1 for a while now, or you could use the Boost libraries.
The only difference that I can see between your & my usecase, is that you also need to be able to remove neighboring nodes. I'm assuming every node contains a list of pointers to it's neighbors. If so, deletion of the neighbors should take k * O(1) with k the number of neighbors (again, O(1) in general, not guaranteed, worst case is O(n) in an unordered_map/set). You just go over every neighboring node, get its priority, that gives you the correct index into the hash map. Then you find the pointer in the hash set which the priority maps to, this search in general will be O(1) and removing the element is again O(1) in general.
All in all, I think you got a pretty good idea of what to do, but I believe that using hash maps/sets will speed up your code by quite a lot (depends on the exact usage of course). For me, the speed improvement of an implementation with unordered_map<int, unordered_set> versus vector<set> was around 50x.
Here's what I would do. Node structure:
struct Node {
std::vector<Node*>::const_iterator first_neighbor;
std::vector<Node*>::const_iterator last_neighbor;
int value;
bool deleted;
};
Concatenate the adjacency lists and put them in a single std::vector<Node*> to lower the overhead of memory management. I'm using soft deletes so update speed is not important.
Sort pointers to the nodes by value into another std::vector<Node*> with a counting sort. Mark all nodes as not deleted.
Iterate through the nodes in sorted order. If the node under consideration has been deleted, go to the next one. Otherwise, mark it deleted and iterate through its neighbors and mark them deleted.
If your nodes are stored contiguously in memory, then you can omit last_neighbor at the cost of an extra sentinel node at the end of the structure, because last_neighbor of a node is first_neighbor of the succeeding node.