Creating a fast dictionary - c++

For a school task I have to create a graph and do some stuff with it. In input each vertex has an ID that is a number 0-999'999'999. As I can not create an array this long, I can't really use this ID as a key in adjacency matrix.
My first solution was to create a separate ID that is arbitrary of the original one and store it in some kind of dictionary/map thing, but as I get 10'000 records of vertices the lookup is probably bound to get slow. The algorithm has to be under O(n^2) and I already have a BFS and a toposort in there.
What would be the best solution in this case? As a side note - I can't use already established libraries (so I can't use graph, map, vector, string classes etc.), but I can code them myself, if that is the best option.

What you want is a binary search tree to do lookups in O(logn) time or a hash map to do lookups in ~O(1) time OR you can go with the array route in which case the size of your array would be the max value your ID can have (in your case, 10^9).
As #amit told you, check AVL/Red-Black trees and hash maps. There's no better way to do lookups in a graph below O(n) unless you can change the topology of the graph to turn it into a "search graph".

Why do you need to create an array of size 1 billion. You can simply create and adjacency matrix or adjacency list of number of nodes.
Whether number of vertices are constant or not, I'd suggest you to go for adjacency list. For example, you have 10 nodes,so you need to create an array of size 10, then for each nodes create a list of edges as you can see in the link above.
Consider this graph, do you really think you need to have 10^10 element in the adjacency list instead of 4 elements?

Related

what are the containers in C++ STL to store a small number of integer values and find them in O(1)

Suppose, I want to create a vector of vectors to store/find the edges between the nodes in a graph. There are many points in the graph which doesn't have any edge and I don't want to save them. e.g. there are 2 millions nodes which 1.5 million of them don't have any edge.
Moreover, each node which I save could have 1 to couple hundreds edges.
After, I saved all the edges, I want to remove the edges which are not exist in both direction. So, if edge (i,j) exist but edge(j,i) doesn't exist, I want to erase the (i,j).
I used "vector of vector" to communicate what I want to create and I know it doesn't scale as it would be completely dense.
So, using vector of vectors format, I start to go through the first vector(suppose it is i) and for each item in it's the second vector (suppose it is j), I need to check if there is i in the second vector of j'th first vector. Which I need to be fast and preferably constant time. Something like hash table which I think std::set might help.
At this point, if the other edge (j,i) does not exist, I need to remove the current edge (i,j).
What would be a good container for my scenario?

Count of previously smaller elements encountered in an input stream of integers?

Given an input stream of numbers ranging from 1 to 10^5 (non-repeating) we need to be able to tell at each point how many numbers smaller than this have been previously encountered.
I tried to use the set in C++ to maintain the elements already encountered and then taking upper_bound on the set for the current number. But upper_bound gives me the iterator of the element and then again I have to iterate through the set or use std::distance which is again linear in time.
Can I maintain some other data structure or follow some other algorithm in order to achieve this task more efficiently?
EDIT : Found an older question related to fenwick trees that is helpful here. Btw I have solved this problem now using segment trees taking hints from #doynax comment.
How to use Binary Indexed tree to count the number of elements that is smaller than the value at index?
Regardless of the container you are using, it is very good idea to enter them as sorted set so at any point we can just get the element index or iterator to know how many elements are before it.
You need to implement your own binary search tree algorithm. Each node should store two counters with total number of its child nodes.
Insertion to binary tree takes O(log n). During the insertion counters of all parents of that new element should be incremented O(log n).
Number of elements that are smaller than the new element can be derived from stored counters O(log n).
So, total running time O(n log n).
Keep your table sorted at each step. Use binary search. At each point, when you are searching for the number that was just given to you by the input stream, binary search is going to find either the next greatest number, or the next smallest one. Using the comparison, you can find the current input's index, and its index will be the numbers that are less than the current one. This algorithm takes O(n^2) time.
What if you used insertion sort to store each number into a linked list? Then you can count the number of elements less than the new one when finding where to put it in the list.
It depends on whether you want to use std or not. In certain situations, some parts of std are inefficient. (For example, std::vector can be considered inefficient in some cases due to the amount of dynamic allocation that occurs.) It's a case-by-case type of thing.
One possible solution here might be to use a skip list (relative of linked lists), as it is easier and more efficient to insert an element into a skip list than into an array.
You have to use the skip list approach, so you can use a binary search to insert each new element. (One cannot use binary search on a normal linked list.) If you're tracking the length with an accumulator, returning the number of larger elements would be as simple as length-index.
One more possible bonus to using this approach is that std::set.insert() is log(n) efficient already without a hint, so efficiency is already in question.

Prim's algorithm for dynamic locations

Suppose you have an input file:
<total vertices>
<x-coordinate 1st location><y-coordinate 1st location>
<x-coordinate 2nd location><y-coordinate 2nd location>
<x-coordinate 3rd location><y-coordinate 3rd location>
...
How can Prim's algorithm be used to find the MST for these locations? I understand this problem is typically solved using an adjacency matrix. Any references would be great if applicable.
If you already know prim, it is easy. Create adjacency matrix adj[i][j] = distance between location i and location j
I'm just going to describe some implementations of Prim's and hopefully that gets you somewhere.
First off, your question doesn't specify how edges are input to the program. You have a total number of vertices and the locations of those vertices. How do you know which ones are connected?
Assuming you have the edges (and the weights of those edges. Like #doomster said above, it may be the planar distance between the points since they are coordinates), we can start thinking about our implementation. Wikipedia describes three different data structures that result in three different run times: http://en.wikipedia.org/wiki/Prim's_algorithm#Time_complexity
The simplest is the adjacency matrix. As you might guess from the name, the matrix describes nodes that are "adjacent". To be precise, there are |v| rows and columns (where |v| is the number of vertices). The value at adjacencyMatrix[i][j] varies depending on the usage. In our case it's the weight of the edge (i.e. the distance) between node i and j (this means that you need to index the vertices in some way. For instance, you might add the vertices to a list and use their position in the list).
Now using this adjacency matrix our algorithm is as follows:
Create a dictionary which contains all of the vertices and is keyed by "distance". Initially the distance of all of the nodes is infinity.
Create another dictionary to keep track of "parents". We use this to generate the MST. It's more natural to keep track of edges, but it's actually easier to implement by keeping track of "parents". Note that if you root a tree (i.e. designate some node as the root), then every node (other than the root) has precisely one parent. So by producing this dictionary of parents we'll have our MST!
Create a new list with a randomly chosen node v from the original list.
Remove v from the distance dictionary and add it to the parent dictionary with a null as its parent (i.e. it's the "root").
Go through the row in the adjacency matrix for that node. For any node w that is connected (for non-connected nodes you have to set their adjacency matrix value to some special value. 0, -1, int max, etc.) update its "distance" in the dictionary to adjacencyMatrix[v][w]. The idea is that it's not "infinitely far away" anymore... we know we can get there from v.
While the dictionary is not empty (i.e. while there are nodes we still need to connect to)
Look over the dictionary and find the vertex with the smallest distance x
Add it to our new list of vertices
For each of its neighbors, update their distance to min(adjacencyMatrix[x][neighbor], distance[neighbor]) and also update their parent to x. Basically, if there is a faster way to get to neighbor then the distance dictionary should be updated to reflect that; and if we then add neighbor to the new list we know which edge we actually added (because the parent dictionary says that its parent was x).
We're done. Output the MST however you want (everything you need is contained in the parents dictionary)
I admit there is a bit of a leap from the wikipedia page to the actual implementation as outlined above. I think the best way to approach this gap is to just brute force the code. By that I mean, if the pseudocode says "find the min [blah] such that [foo] is true" then write whatever code you need to perform that, and stick it in a separate method. It'll definitely be inefficient, but it'll be a valid implementation. The issue with graph algorithms is that there are 30 ways to implement them and they are all very different in performance; the wikipedia page can only describe the algorithm conceptually. The good thing is that once you implement it some way, you can find optimizations quickly ("oh, if I keep track of this state in this separate data structure, I can make this lookup way faster!"). By the way, the runtime of this is O(|V|^2). I'm too lazy to detail that analysis, but loosely it's because:
All initialization is O(|V|) at worse
We do the loop O(|V|) times and take O(|V|) time to look over the dictionary to find the minimum node. So basically the total time to find the minimum node multiple times is O(|V|^2).
The time it takes to update the distance dictionary is O(|E|) because we only process each edge once. Since |E| is O(|V|^2) this is also O(|V|^2)
Keeping track of the parents is O(|V|)
Outputting the tree is O(|V| + |E|) = O(|E|) at worst
Adding all of these (none of them should be multiplied except within (2)) we get O(|V|^2)
The implementation with a heap is O(|E|log(|V|) and it's very very similar to the above. The only difference is that updating the distance is O(log|V|) instead of O(1) (because it's a heap), BUT finding/removing the min element is O(log|V|) instead of O(|V|) (because it's a heap). The time complexity is quite similar in analysis and you end up with something like O(|V|log|V| + |E|log|V|) = O(|E|log|V|) as desired.
Actually... I'm a bit confused why the adjacency matrix implementation cares about it being an adjacency matrix. It could just as well be implemented using an adjacency list. I think the key part is how you store the distances. I could be way off in my implementation outlined above, but I am pretty sure it implements Prim's algorithm is satisfies the time complexity constraints outlined by wikipedia.

Hashmap to implement adjacency lists

I've implement an adjacency list using the vector of vectors approach with the nth element of the vector of vectors refers to the friend list of node n.
I was wondering if the hash map data structure would be more useful. I still have hesitations because I simply cannot identify the difference between them and for example if I would like to check and do an operation in nth elements neighbors (search,delete) how could it be more efficient than the vector of vectors approach.
A vector<vector<ID>> is a good approach if the set of nodes is fixed. If however you suddenly decide to remove a node, you'll be annoyed. You cannot shrink the vector because it would displace the elements stored after the node and you would lose the references. On the other hand, if you keep a list of free (reusable) IDs on the side, you can just "nullify" the slot and then reuse later. Very efficient.
A unordered_map<ID, vector<ID>> allows you to delete nodes much more easily. You can go ahead and assign new IDs to the newly created nodes and you will not be losing empty slots. It is not as compact, especially on collisions, but not so bad either. There can be some slow downs on rehashing when a vector need be moved with older compilers.
Finally, a unordered_multimap<ID, ID> is probably one of the easiest to manage. It also scatters memory to the wind, but hey :)
Personally, I would start prototyping with a unordered_multimap<ID, ID> and switch to another representation only if it proves too slow for my needs.
Note: you can cut in half the number of nodes if the adjacency relationship is symmetric by establishing than the relation (x, y) is stored for min(x, y) only.
Vector of vectors
Vector of vectors is good solution when you don't need to delete edges.
You can add edge in O(1), you can iterate over neighbours in O(N).
You can delete edge by vector[node].erase(edge) but it will be slow, complexity only O(number of vertices).
Hash map
I am not sure how you want to use hash map. If inserting edge means setting hash_map[edge] = 1 then notice that you are unable to iterate over node's neighbours.

Effective data structure for both deleteMin and search by key operations

I have 100 sets of A objects, each set corresponding to a query point Qi, 1 <= i <= 100.
class A {
int id;
int distance;
float x;
float y;
}
In each iteration of my algorithm, I select one query point Qi and extract from the corresponding set the object having the minimum distance value. Then, I have to find this specific object in all 100 sets, searching with its id, and remove all those objects.
If I use a heap for each set of objects, it is cheap to extract the object with MIN(distance). However, I will not be able to find the same object in other heaps searching with the id, because the heap is organized with the distance value. Further, updating the heap is expensive.
Another option I have considered is using a map<id, (distance, x, y)> for each set. This way searching (find operation) by id is cheap. However, extracting the element with the minimum value takes linear time (it has to examine every element in the map).
Is there any data structure that I could use that is efficient for both the operations I need?
extract_min(distance)
find(id)
Thanks in advance!
std::map or boost::multi_index
You could use a tree map.
One simple approach is to have two maps for each data set. The first one contains all the data items sorted by id. The second would be a multimap and map distance to id so that you could easily figure out what id the lowest distance corresponds to. This one would be ordered by distance to make finding the min cheap (since it would use distance as the key). You could use map instead of multimap if you know that distances will always be unique.
In addition to ncluding a map as
suggested by many above, you
could replace your minimum heap
with a structure that has a
runtime complexity that is constant
for extract min. Your current version
has runtime complexity of O(log_2(n))
for extract min.
Since the range of your distances is
small, you could use a "Dial array"
algorithm. The keys are like "counting sort".
Because you may have more than one item in
an array item, but you don't care about the
order of equal value items, you would use
a doubly linked list as the array's item
data type. The Andrew Goldberg and Tarjan papers
regarding faster Dijkstra's algorithms
dicuss this in more detail.