I need an algorithm to find a maximum independent subgroup of hashmaps, where it represented in an array of hashmaps.
I tried to go over the array of the hashmaps and send and index every time and see which hashmaps in the array not independent with the hashmaps in this index, it worked but in case of
A and B independent
B and C independent
but A and C can be not independent
Definition of maximum independent subgroup of hashmaps:
I have an array which contain hashmaps, every hashmap contain a key, every two hashmaps called independent if every key in the first hashmap is not contained in the second map so I have to find a subgroup of those hashmaps which all are independent
First of all, this problem is NP-complete.
To prove this, suppose a graph, with indexed edges.
Create a HashMap for every vertex and fill it with indices for every incident edge of that vertex.
Then if two HashMaps are independent, they do not contain the same edge, therefore the respective vertices are independent as well.
Finding a maximum independent subset of these HashMaps hence gives you the maximum independent set in the graph, which we know is NP-complete.
You can solve this problem by constructing a graph, with a vertex for each HashMap and adding an edge for every two dependent HashMaps and then using some algorithm for independent sets.
Another way is to take the complement graph and finde max clique.
Since this is inefficient, you may consider using some approximation algorithm.
For a school task I have to create a graph and do some stuff with it. In input each vertex has an ID that is a number 0-999'999'999. As I can not create an array this long, I can't really use this ID as a key in adjacency matrix.
My first solution was to create a separate ID that is arbitrary of the original one and store it in some kind of dictionary/map thing, but as I get 10'000 records of vertices the lookup is probably bound to get slow. The algorithm has to be under O(n^2) and I already have a BFS and a toposort in there.
What would be the best solution in this case? As a side note - I can't use already established libraries (so I can't use graph, map, vector, string classes etc.), but I can code them myself, if that is the best option.
What you want is a binary search tree to do lookups in O(logn) time or a hash map to do lookups in ~O(1) time OR you can go with the array route in which case the size of your array would be the max value your ID can have (in your case, 10^9).
As #amit told you, check AVL/Red-Black trees and hash maps. There's no better way to do lookups in a graph below O(n) unless you can change the topology of the graph to turn it into a "search graph".
Why do you need to create an array of size 1 billion. You can simply create and adjacency matrix or adjacency list of number of nodes.
Whether number of vertices are constant or not, I'd suggest you to go for adjacency list. For example, you have 10 nodes,so you need to create an array of size 10, then for each nodes create a list of edges as you can see in the link above.
Consider this graph, do you really think you need to have 10^10 element in the adjacency list instead of 4 elements?
My program contains polygons which have the form of a vector containing points (2 dimensional double coordinates, stored in a self-made structure). I'm looking for a quick way of finding the smallest square containing my polygon (ie. knowing the maximal and minimal coordinates of all the points).
Is there a quicker way than just parsing all the points and storing the minimum and maximum values?
The algorithm ou are describing is straightforward: Iterate over all your points and find the minimum and maximum for each coordinate. This is an O(n) algorithm, n being the number of points you have.
You can't do better, since you will need to check at least all your points once, otherwise the last one could be outside the square you found.
Now, the complexity is at best O(n) so you just have to minimize the constant factors, but in that case it's already pretty small : Only one loop over your vector, looking for two maximums and two minimums.
You can either iterate through all points and find max and min values, or do some preprocessing, for example, store your points in treap (http://en.wikipedia.org/wiki/Treap).
There is no way w/o some preprocessing to do it better than just iterating over all points.
I'm not sure if there can be any faster way to find the min & max values in an array of values than linear time. The only 'optimization' I can think of is to find these values on one of the other occasions you're iterating the array (filling it/performing a function on all points), then perform checks on any data update.
Suppose you have an input file:
<total vertices>
<x-coordinate 1st location><y-coordinate 1st location>
<x-coordinate 2nd location><y-coordinate 2nd location>
<x-coordinate 3rd location><y-coordinate 3rd location>
How can Prim's algorithm be used to find the MST for these locations? I understand this problem is typically solved using an adjacency matrix. Any references would be great if applicable.
If you already know prim, it is easy. Create adjacency matrix adj[i][j] = distance between location i and location j
I'm just going to describe some implementations of Prim's and hopefully that gets you somewhere.
First off, your question doesn't specify how edges are input to the program. You have a total number of vertices and the locations of those vertices. How do you know which ones are connected?
Assuming you have the edges (and the weights of those edges. Like #doomster said above, it may be the planar distance between the points since they are coordinates), we can start thinking about our implementation. Wikipedia describes three different data structures that result in three different run times: http://en.wikipedia.org/wiki/Prim's_algorithm#Time_complexity
The simplest is the adjacency matrix. As you might guess from the name, the matrix describes nodes that are "adjacent". To be precise, there are |v| rows and columns (where |v| is the number of vertices). The value at adjacencyMatrix[i][j] varies depending on the usage. In our case it's the weight of the edge (i.e. the distance) between node i and j (this means that you need to index the vertices in some way. For instance, you might add the vertices to a list and use their position in the list).
Now using this adjacency matrix our algorithm is as follows:
Create a dictionary which contains all of the vertices and is keyed by "distance". Initially the distance of all of the nodes is infinity.
Create another dictionary to keep track of "parents". We use this to generate the MST. It's more natural to keep track of edges, but it's actually easier to implement by keeping track of "parents". Note that if you root a tree (i.e. designate some node as the root), then every node (other than the root) has precisely one parent. So by producing this dictionary of parents we'll have our MST!
Create a new list with a randomly chosen node v from the original list.
Remove v from the distance dictionary and add it to the parent dictionary with a null as its parent (i.e. it's the "root").
Go through the row in the adjacency matrix for that node. For any node w that is connected (for non-connected nodes you have to set their adjacency matrix value to some special value. 0, -1, int max, etc.) update its "distance" in the dictionary to adjacencyMatrix[v][w]. The idea is that it's not "infinitely far away" anymore... we know we can get there from v.
While the dictionary is not empty (i.e. while there are nodes we still need to connect to)
Look over the dictionary and find the vertex with the smallest distance x
Add it to our new list of vertices
For each of its neighbors, update their distance to min(adjacencyMatrix[x][neighbor], distance[neighbor]) and also update their parent to x. Basically, if there is a faster way to get to neighbor then the distance dictionary should be updated to reflect that; and if we then add neighbor to the new list we know which edge we actually added (because the parent dictionary says that its parent was x).
We're done. Output the MST however you want (everything you need is contained in the parents dictionary)
I admit there is a bit of a leap from the wikipedia page to the actual implementation as outlined above. I think the best way to approach this gap is to just brute force the code. By that I mean, if the pseudocode says "find the min [blah] such that [foo] is true" then write whatever code you need to perform that, and stick it in a separate method. It'll definitely be inefficient, but it'll be a valid implementation. The issue with graph algorithms is that there are 30 ways to implement them and they are all very different in performance; the wikipedia page can only describe the algorithm conceptually. The good thing is that once you implement it some way, you can find optimizations quickly ("oh, if I keep track of this state in this separate data structure, I can make this lookup way faster!"). By the way, the runtime of this is O(|V|^2). I'm too lazy to detail that analysis, but loosely it's because:
All initialization is O(|V|) at worse
We do the loop O(|V|) times and take O(|V|) time to look over the dictionary to find the minimum node. So basically the total time to find the minimum node multiple times is O(|V|^2).
The time it takes to update the distance dictionary is O(|E|) because we only process each edge once. Since |E| is O(|V|^2) this is also O(|V|^2)
Keeping track of the parents is O(|V|)
Outputting the tree is O(|V| + |E|) = O(|E|) at worst
Adding all of these (none of them should be multiplied except within (2)) we get O(|V|^2)
The implementation with a heap is O(|E|log(|V|) and it's very very similar to the above. The only difference is that updating the distance is O(log|V|) instead of O(1) (because it's a heap), BUT finding/removing the min element is O(log|V|) instead of O(|V|) (because it's a heap). The time complexity is quite similar in analysis and you end up with something like O(|V|log|V| + |E|log|V|) = O(|E|log|V|) as desired.
Actually... I'm a bit confused why the adjacency matrix implementation cares about it being an adjacency matrix. It could just as well be implemented using an adjacency list. I think the key part is how you store the distances. I could be way off in my implementation outlined above, but I am pretty sure it implements Prim's algorithm is satisfies the time complexity constraints outlined by wikipedia.
I need to store data grouping nodes of a graph partition, something like:
[node1, node2] [node3] [node4, node5, node6]
My first idea was to have just a simple vector or array of ints, where the position in the array denoted the node_id and it's value is some kind of group_id
The problem is many partition algorithms rely on operating on pairs of nodes within a group. With this method, I think I would waste a lot of computation searching through the vector to find out which nodes belong to the same group.
I could also store as a stl set of sets, which seems closer to the mathematical definition of a partition, but I am getting the impression nested sets are not advised or unnecessary, and I would need to modify the inner sets which I am not sure is possible.
Any suggestions?
Depending on what exactly you want to do with the sets, you could try a disjoint set data structure. In this structure, each element has a method find that returns the "representative" of the set it belongs to.
A C++ implementation is available in Boost.
There are two good data structures that come to mind.
The first data structure (and one that's been mentioned here before) is the disjoint-set forest, which gives extraordinarily efficient implementations of "merge these two sets" and "what set is x in?". However, it does not support the operation of splitting groups apart from one another.
The other structure I'd recommend is a link/cut tree. This structure lets you build up partitions of a graph that can be joined together into trees. Unlike the disjoint set forest, the tree describing the partition can be cut into smaller trees, allowing you to break partitions into smaller groups. This structure is a bit less efficient than the union/find structure, but it still supports all operations in amortized O(lg n).
I have 100 sets of A objects, each set corresponding to a query point Qi, 1 <= i <= 100.
class A {
int id;
int distance;
float x;
float y;
In each iteration of my algorithm, I select one query point Qi and extract from the corresponding set the object having the minimum distance value. Then, I have to find this specific object in all 100 sets, searching with its id, and remove all those objects.
If I use a heap for each set of objects, it is cheap to extract the object with MIN(distance). However, I will not be able to find the same object in other heaps searching with the id, because the heap is organized with the distance value. Further, updating the heap is expensive.
Another option I have considered is using a map<id, (distance, x, y)> for each set. This way searching (find operation) by id is cheap. However, extracting the element with the minimum value takes linear time (it has to examine every element in the map).
Is there any data structure that I could use that is efficient for both the operations I need?
Thanks in advance!
std::map or boost::multi_index
You could use a tree map.
One simple approach is to have two maps for each data set. The first one contains all the data items sorted by id. The second would be a multimap and map distance to id so that you could easily figure out what id the lowest distance corresponds to. This one would be ordered by distance to make finding the min cheap (since it would use distance as the key). You could use map instead of multimap if you know that distances will always be unique.
In addition to ncluding a map as
suggested by many above, you
could replace your minimum heap
with a structure that has a
runtime complexity that is constant
for extract min. Your current version
has runtime complexity of O(log_2(n))
for extract min.
Since the range of your distances is
small, you could use a "Dial array"
algorithm. The keys are like "counting sort".
Because you may have more than one item in
an array item, but you don't care about the
order of equal value items, you would use
a doubly linked list as the array's item
data type. The Andrew Goldberg and Tarjan papers
regarding faster Dijkstra's algorithms
dicuss this in more detail.