Kruskal's algorithm explanation

Kruskal's algorithm explanation - c++

I was reading wikipeida and found Kruskal's Pseudocode as the following:
KRUSKAL(G):
foreach v ∈ G.V:
MAKE_SET(v)
G.E = sort(G.E)
i = 0
while (i != |V|-1):
pick the next (u, v) edge from sorted list of edges G.E
if (FIND_SET(u) != FIND_SET(v)):
UNION(u, v)
i = i + 1
I'm not quiet sure what FIND_SET() does, and Wikipedia has the follow description:
if that edge connects two different trees, then add it to the forest, combining two trees into a single tree.
So I guess it checks if two different trees are connected, but what does this really mean?

Initially, each vertex is in a set all by itself: There is a singleton set {v} for every vertex v. In the pseudo-code, these sets are the result of make_set(v).
For a given vertex v, the function find_set(v) gives you the set containing v.
The algorithm merges sets iteratively, so if {u}, {v} are singleton sets initially and there is an edge (u, v), then the algorithm replaces the two sets by their union {u, v}. Now both find_set(u) and find_set(v) will return that set.
The algorithm terminates after you've added |V| - 1 non-trivial edges, which is precisely the number of edges in a tree.

The find_set() is a common operation of a kind of data structure known as Union-Find. The idea of this data structure is to have disjoint sets (of vertex in your example).
In this algorithm I think that each set represents vertex that are connected.
So when you call find_set() passing a vertex, you will receive the element that represents that set of connected verxtex.

FIND_SET(x) finds the set associated with edge x, so that the comparison:
FIND_SET(u) != FIND_SET(v)
Ensures that u and v are not connected to the same thing. A useful way of thinking about it is that it finds the "values" of u and v where the values are in themselves sets.
The part about merging the forests has nothing to do with FIND_SET, but rather the next line:
UNION(u,v)
Which merges the two sets.

find_set(u)!=find_set(v)
signifies the basic property of spanning tree that is it does not make cycles.If they are equal then it shows there is a cycle in graph.
We basically make a forest (of minimum edge weights) through Kruskal algorithm and at each step checks whether it is making cycle or not.
Hope it helps :)

Related

Representation of a simple undirected graph

I need your expertise:
I am about to implement a graph class in c++ and thinking about the right representation. The graphs are simple and undirected. Number of vertices get for now just up to 1000 but maybe higher in the future. Number of edges up to 200k and maybe higher. Each vertex got a color (int) and an id (int). Edges transport no more information than connecting to vertices.
I store the graph and just need to access if x and y are connected or not - this I need very often.
After initialising i never remove or add new vertices or edges (N = Number of vertices and M=number of edges given from the start)!
The one representation which is already available to me:
An adjacency list rolled out into just one long list. Along with this representation goes an array with starting indices for each vertex. Storage O(2M) and check if edge between x and y in an average of O(n/m)
A representation I thought of:
the idea is to, instead of rolling out the adjacency list into one array, do it with the adjacency matrix. So storage O(N^2)? Yes but I want to store an edge in one bit except of one byte.(actually 2 bits symmetricallywise)
Example: Say N=8, then create an vector<uint8_t> of length 8 (64 bit). Init each entry on 0. If there is an edge between vertex 3 and vertex 5, then add pow(2,5) to the entry of my vector belonging to vertex 3 and symmetrically. So there is a 1 in the entry of vertex 3 at position of vertex 5 exactly when there is an edge between 3 and 5. After inserting my graph into this data structure I think one should be able to access neighborhood in constant time by just a binary operation: Are 3 and 5 connected? Yes if v[3] ^ pow(2,5) == 0. When there are more vertices than 8, then every vertex needs to get more than one entry in the vector and I need to perform one modulo and one division operation for accessing the correct spot.
What do you think of the second solution - is it maybe already known and in use?
Am I wrong by thinking about an access of O(1)?
Is it to much effort for no real performance improvement?
The reason for loading both representations in one big list is due to cache improvements I was told.
I am happy to get some feedback on this idea. I might be way off - pls be kind in that case :D

A 1000x1000 matrix with 200,000 edges will be quite sparse. Since the graph is undirected, the edges in the matrix will be written twice:
VerticeA -> VerticeB and VerticeB -> VerticeA
You will end up filling up 40% of the matrix, the rest will be empty.
Edges
The best approach I can think of here is to use a 2D vector of booleans:
std::vector<std::vector<bool>> matrix(1000, std::vector<bool>(1000, false));
The lookup will take O(1) time and std::vector<bool> saves space by using a single bit for each boolean value. You will end up using 1Mbit or ~128kB (125 kB) of memory.
The storage is not necessarily an array of bool values, but the library implementation may optimize storage so that each value is stored in a single bit.
This will allow you to check for an edge like this:
if( matrix[3][5] )
{
// vertice 3 and 5 are connected
}
else
{
// vertice 3 and 5 are not connected
}
Vertices
If the id values of the vertices form a continuous range of ints (e.g. 0,1,2,3,...,999) then you could store the color information in a std::vector<int> which has O(1) access time:
std::vector<int> colors(1000);
This would use up memory equal to:
1000 * sizeof(int) = 4000 B ~ 4 kB (3.9 kB)
On the other hand, if the id values don't form a continuous range of ints it might be a better idea to use a std::unordered_map<int, int> which will on average give you O(1) lookup time.
std::unordered_map<int, int> map;
So e.g. to store and look up the color of vertice 4:
map[4] = 5; // assign color 5 to vertice 4
std::cout << map[4]; // prints 5
The amount of memory used by std::unordered_map<int, int> will be:
1000 * 2 * sizeof(int) = 8000 B ~ 8 kB (7.81 kB)
All together, for edges:
Type
Memory
Access time
std::vector<std::vector<bool>>
125 kB
O(1)
and for vertices:
Type
Memory
Access time
std::vector<int>
3.9 kB
O(1)
std::unordered_map<int, int>
7.8 kB
O(1) on avg.

If you go for a bit matrix then the memory usage is O(V^2), so ~1Mb bits or 128KB, of which slightly less than half really are duplicates.
If you make an array of the edges O(E) and another array of index into the edges from the vertexes to the first of its edge you use 200K*sizeof(int) or 800KB which is much more, half of it is also duplicates (A-B and B-A are the same) which here actually could be saved. Same if you know (or can template you out of it) that the number of vertexes can be stored in an uint16_t half can be saved again.
To save half you just check which of the Vertexes has the lower number and checks its edges.
To find out when to stop looking you use the index on the next Vertex.
So with your numbers it is fine or even good to use a bit matrix.
The first problem comes when (V^2)/8 > (E*4) though the binary search in the Edge algorithm would still be much slower than checking a bit. That would occur if we set E = V * 200 (1000 Vertexes vs 200K edges)
V*V/8 > V*200*4
V/8 > 200*4
V > 200*4*8 = 6400
That would be 5120000 ~ 5MB easily fits into a L3 cache nowadays. If the connectivity (here average number of connections per vertex) is higher than 200 so much the better.
Checking the edges will also cost lg2(connectivity)*K(mispredicts) which gets rather steep. checking the bit matrix would be O(1).
You would need to measure, among others when the bit matrix breaks the L3 significantly while the Edge list still fits L3 and when it spills over in virtual memory.
In other words with a high connectivity the bit matrix should beat the Edge list with a much lower connectivity or much higher number of vertexes the Edge list might be faster.

Snake game - random number generator for food tiles

I am trying to make a 16x16 LED Snake game using Arduino (C++).
I need to assign a random grid index for the next food tile.
What I have is a list of indices that are occupied by the snake (snakeSquares).
So, my thought is that I need to generate a list of potential foodSquares. Then I can pick a random index from that list, and use the value there for my next food square.
I have some ideas for this but they seem kind of clunky, so I was looking for some feedback. I am using the Arduino LinkedList.h library for my lists in lieu of stdio.h (and random() in place of rand()):
Generate a list (foodSquares) containing the integers [0, 255] so that the indices correspond to the values in the list (I don't know of a quick way to do this, will probably need to use a for loop).
When generating list of snakeSquares, set foodSquares[i] = -1. Afterwards, loop through foodSquares and remove all elements that equal -1.
Then, generate a random number randNum from [0, foodSquares.size()-1] and make the next food square index equal to foodSquares[randNum].
So I guess my question is, will this approach work, and is there a better way to do it?
Thanks.

Potential approach that won't require more lists:
Calculate random integer representing number of steps.
Take head or tail as a starting tile.
For each step move at random free adjacent tile.

I couldn't understand it completely your question as some of those points are quite waste of processor time (i.e. point 1 and 2). But, the first point could be solved quite easily in n proportional complexity as follows:
for (uint8_t i = 0; i < 256; i++) {
// assuming there is a list of food_squares
food_squares[i] = i;
}
Then to the second point you would have to set every food_square to -1, for what? Anyway. A way you could implement this would be as VTT has said and I will describe it further:
Take a random number between [0..255].
Does it is one the snake_squares? If so, back to one, else, go to three.
This is the same as your third point, use this random number to set the position of the food in food_square (food_square[random_number] = some_value).

How to find if there's a path between every pair of vertices of a given graph?

To find if there's a path between every pair of vertices in a directed graph I'm checking if all vertices can be visited from a specific vertex using DFS. The problem is that I have to do V DFS's where V is the number of vertices. (V can be up to 10^5). Is there a more efficient way to do this? Some pseudo-code or implementations would be appreciated.
Consider this graph: (1 -> 3), (2 -> 3), (3 -> 1)
There' no path from 1 to 2, but there's a path from 2 to 1 (2 -> 3 -> 1). So this means that there's a path for every pair of vertices (u -> v) even if there's no path (v -> u).

Have a look at Tarjan's algorithm for strongly connected components. If only one strongly connected component exists, this means there exists a path between every pair of vertices.
To figure this out, do a topological sort of the graph, and then traverse it in reverse pseudo-topological order. If you don't need to 'restart' a traversal this means, this means there exists a path between every possible vertex.

To find if there is a single path that visits all vertices of a directed graph (which can visit vertices and edges multiple times) then:
Find the strongly connected components [SCC] of the graph.
Reduce the graph replacing each SCC with a single "pseudo-vertex" and include the edges connecting the SCCs.
The graph will now contain no cycles (since every cycle would have been part of an SCC) so will be a rooted graph/tree and there must be:
One-or-more root pseudo-vertices (with no inbound edges);
One-or-more leaf pseudo-vertices (with no outbound edges; which can be a leaf and a root vertex at the same time); and
Zero-or-more pseudo-vertices which are part of a branch (with both inbound and outbound edges).
Trivially, each leaf and part of a branch will be reachable from an ancestor pseudo-vertex (since they have inbound edges) but it is not possible to reach one branch from another parallel branch so we only need to consider whether the resulting graph is a simple path with no branches.
Count the number of root vertices:
If there is a single root pseudo-vertex (SCC) then any vertex contained in that SCC can reach all other vertices in the graph;
If there is more than 1 root pseudo-vertex then there is no vertex that has a path to all other vertices (since you cannot reach one root from a different root).
If the singular root pseudo-vertex and each subsequent descendant pseudo-vertex only has a single outbound edge (i.e. there are no branches) until the leaf is reached then the resulting graph has contains a path that can reach all vertices.
Examples:
After reducing the SCCs to pseudo-vertices if the graph is of the form:
(1) -> (2) -> ... (n-1) -> (n)
Then there is a path that can visit all vertices.
If it is of the form:
(1_a) --\
+--> (2) -> ... (n-1) -> (n)
(1_b) --/
Then the vertex (1_a) is not reachable from (1_b) and vice-versa so there is no path that can reach all vertices.
Similarly:
/-> (n_a)
(1) -> (2) -> ... -+
\-> n_b
Then the vertex (n_a) is not reachable from (n_b) and vice-versa so there is no path that can reach all vertices.
And finally, if it is the form:
/-> (x_a) -\
(1) -> (2) -> ... -+ +-> ... (n-1) -> (n)
\-> (x_b) -/
Then there no path that can reach both (x_a) and (x_b).

I don't know why some answer seems so unnecessarily complicated. In fact, you can just use the topological sort of the graph, and check whether there is an edge connecting each node and its subsequent node.

Algorithm to produce a difference of two collections of intervals

Problem
Suppose I have two collections of intervals, named A and B. How would I find a difference (a relative complement) in a most time- and memory-efficient way?
Picture for illustration:
Interval endpoints are integers ( ≤ 2128-1 ) and they are always both 2n long and aligned on the m×2n lattice (so you can make a binary tree out of them).
Intervals can overlap in the input but this does not affect the output (the result if flattened would be the same).
The problem is because there are MANY intervals in both collections (up to 100,000,000), so naïve implementations will probably be slow.
The input is read from two files and it is sorted in such a way that smaller sub-intervals (if overlapping) come immediately after their parents in order of size. For example:
[0,7]
[0,3]
[4,7]
[4,5]
[8,15]
...
What have I tried?
So far, I've been working on a implementation that generates a binary search tree while doing so aggregates neighbouring intervals ( [0,3],[4,7] => [0,7] ) from both collections, then traverses the second tree and "bumps out" the intervals that are present in both (subdividing the larger intervals in the first tree if necessary).
While this appears to be working for small collections, it requires more and more RAM to hold the tree itself, not to mention the time it needs to complete the insertion and removal from the tree.
I figured that since intervals come pre-sorted, I could use some dynamic algorithm and finish in one pass. I am not sure if this is possible, however.
So, how would I go about solving this problem in a efficient way?
Disclaimer: This is not a homework but a modification/generalization of an actual real-life problem I am facing. I am programming in C++ but I can accept an algorithm in any [imperative] language.

Recall one of the first programming exercises we all had back in school - writing a calculator program. Taking an arithmetic expression from the input line, parsing it and evaluating. Remember keeping track of the parentheses depth? So here we go.
Analogy: interval start points are opening parentheses, end points - closing parentheses. We keep track of the parentheses depth (nesting). The depth of two - intersection of intervals, the depth of one - difference of intervals
Algorithm:
No need to distinguish between A and B, just sort all start points and end points in the ascending order
Set the parentheses depth counter to zero
Iterate through the end points, starting from the smallest one. If it is a starting point increment the depth counter, if it is an ending point decrement the counter
Keep track of intervals where the depth is 1, those are intervals of A and B difference. The intervals where the depth is two are AB intersections

Your intervals are sorted which is great. You can do this in linear time with almost no memory.
Start by "flattening" your two sets. That is for set A, start from the lowest interval, and combine any overlapping intervals until you have an interval set that has no overlaps. Then do that for B.
Now take your two sets and start with the first two intervals. We'll call these the interval indices for A and B, Ai and Bi.
Ai indexes the first interval in A, Bi the first interval in B.
While there are intervals to process do the following:
Consider the start points of both intervals, are the start points the same? If so advance the start point of both intervals to the end point of the smaller interval, emit nothing to your output. Advance the index of the smaller interval to the next interval. (That is if Ai ends before Bi, then Ai advances to the next interval.) If both intervals end in the same place, advance both Ai and Bi and emit nothing.
Is the one start point earlier than the other start point? If so emit the interval from the earlier start point to either a) the start of the later endpoint, or b) the end of the earlier end point. If you chose option b, advance the index of the eariler interval.
So for example if the interval at Ai starts first, you emit the interval from start of Ai to start of Bi, or the end of Ai whichever is smaller. If Ai ended before the start of Bi, you advance Ai.
Repeat until all intervals are consumed.
Ps. I assume you don't have spare memory to flatten the two interval sets into separate buffers. Do this in two functions. A "get next interval" function that advances the interval indices, which does the flattening as necessary, and feed flattened data to the differencing function.

What you are looking for is a Sweep line algorithm.
A simple logic should tell you when the Sweep line is intersecting an interval in both A and B and where it intersects only one set.
This is very similar to this problem. Just consider that you have a set of vertical lines passing through the end points of the B's segments.
This algorithm complexity is O((m+n) log (m+n)) which is the cost of the initial sort. The sweep line algorithm on a sorted set takes O(m+n)

I think you should use boost.icl (Interval Container Library)
http://www.boost.org/doc/libs/1_50_0/libs/icl/doc/html/index.html
#include <iostream>
#include <boost/icl/interval_set.hpp>
using namespace boost::icl;
int main()
{
typedef interval_set<int> TIntervalSet;
TIntervalSet intSetA;
TIntervalSet intSetB;
intSetA += discrete_interval<int>::closed( 0, 2);
intSetA += discrete_interval<int>::closed( 9,15);
intSetA += discrete_interval<int>::closed(12,15);
intSetB += discrete_interval<int>::closed( 1, 2);
intSetB += discrete_interval<int>::closed( 4, 7);
intSetB += discrete_interval<int>::closed( 9,10);
intSetB += discrete_interval<int>::closed(12,13);
std::cout << intSetA << std::endl;
std::cout << intSetB << std::endl;
std::cout << intSetA - intSetB << std::endl;
return 0;
}
this prints
{[0,2][9,15]}
{[1,2][4,7][9,10][12,13]}
{[0,1)(10,12)(13,15]}

Extracting segments from a list of 8-connected pixels

Current situation: I'm trying to extract segments from an image. Thanks to openCV's findContours() method, I now have a list of 8-connected point for every contours. However, these lists are not directly usable, because they contain a lot of duplicates.
The problem: Given a list of 8-connected points, which can contain duplicates, extract segments from it.
Possible solutions:
At first, I used openCV's approxPolyDP() method. However, the results are pretty bad... Here is the zoomed contours:
Here is the result of approxPolyDP(): (9 segments! Some overlap)
but what I want is more like:
It's bad because approxPolyDP() can convert something that "looks like several segments" in "several segments". However, what I have is a list of points that tend to iterate several times over themselves.
For example, if my points are:
0 1 2 3 4 5 6 7 8
9
Then, the list of point will be 0 1 2 3 4 5 6 7 8 7 6 5 4 3 2 1 9... And if the number of points become large (>100) then the segments extracted by approxPolyDP() are unfortunately not duplicates (i.e : they overlap each other, but are not strictly equal, so I can't just say "remove duplicates", as opposed to pixels for example)
Perhaps, I've got a solution, but it's pretty long (though interesting). First of all, for all 8-connected list, I create a sparse matrix (for efficiency) and set the matrix values to 1 if the pixel belongs to the list. Then, I create a graph, with nodes corresponding to pixels, and edges between neighbouring pixels. This also means that I add all the missing edges between pixels (complexity small, possible because of the sparse matrix). Then I remove all possible "squares" (4 neighbouring nodes), and this is possible because I am already working on pretty thin contours. Then I can launch a minimal spanning tree algorithm. And finally, I can approximate every branch of the tree with openCV's approxPolyDP()
To sum up: I've got a tedious method, that I've not yet implemented as it seems error-prone. However, I ask you, people at Stack Overflow: are there other existing methods, possibly with good implementations?
Edit: To clarify, once I have a tree, I can extract "branches" (branches start at leaves or nodes linked to 3 or more other nodes) Then, the algorithm in openCV's approxPolyDP() is the Ramer–Douglas–Peucker algorithm, and here is the Wikipedia picture of what it does:
With this picture, it is easy to understand why it fails when points may be duplicates of each other
Another edit: In my method, there is something that may be interesting to note. When you consider points located in a grid (like pixels), then generally, the minimal spanning tree algorithm is not useful because there are many possible minimal trees
X-X-X-X
|
X-X-X-X
is fundamentally very different from
X-X-X-X
| | | |
X X X X
but both are minimal spanning trees
However, in my case, my nodes rarely form clusters because they are supposed to be contours, and there is already a thinning algorithm that runs beforehand in the findContours().
Answer to Tomalak's comment:
If DP algorithm returns 4 segments (the segment from the point 2 to the center being there twice) I would be happy! Of course, with good parameters, I can get to a state where "by chance" I have identical segments, and I can remove duplicates. However, clearly, the algorithm is not designed for it.
Here is a real example with far too many segments:

Using Mathematica 8, I created a morphological graph from the list of white pixels in the image. It is working fine on your first image:
Create the morphological graph:
graph = MorphologicalGraph[binaryimage];
Then you can query the graph properties that are of interest to you.
This gives the names of the vertex in the graph:
vertex = VertexList[graph]
The list of the edges:
EdgeList[graph]
And that gives the positions of the vertex:
pos = PropertyValue[{graph, #}, VertexCoordinates] & /# vertex
This is what the results look like for the first image:
In[21]:= vertex = VertexList[graph]
Out[21]= {1, 3, 2, 4, 5, 6, 7, 9, 8, 10}
In[22]:= EdgeList[graph]
Out[22]= {1 \[UndirectedEdge] 3, 2 \[UndirectedEdge] 4, 3 \[UndirectedEdge] 4,
3 \[UndirectedEdge] 5, 4 \[UndirectedEdge] 6, 6 \[UndirectedEdge] 7,
6 \[UndirectedEdge] 9, 8 \[UndirectedEdge] 9, 9 \[UndirectedEdge] 10}
In[26]:= pos = PropertyValue[{graph, #}, VertexCoordinates] & /# vertex
Out[26]= {{54.5, 191.5}, {98.5, 149.5}, {42.5, 185.5},
{91.5, 138.5}, {132.5, 119.5}, {157.5, 72.5},
{168.5, 65.5}, {125.5, 52.5}, {114.5, 53.5},
{120.5, 29.5}}
Given the documentation, http://reference.wolfram.com/mathematica/ref/MorphologicalGraph.html, the command MorphologicalGraph first computes the skeleton by morphological thinning:
skeleton = Thinning[binaryimage, Method -> "Morphological"]
Then the vertex are detected; they are the branch points and the end points:
verteximage = ImageAdd[
MorphologicalTransform[skeleton, "SkeletonEndPoints"],
MorphologicalTransform[skeleton, "SkeletonBranchPoints"]]
And then the vertex are linked after analysis of their connectivity.
For example, one could start by breaking the structure around the vertex and then look for the connected components, revealing the edges of the graph:
comp = MorphologicalComponents[
ImageSubtract[
skeleton,
Dilation[vertices, CrossMatrix[1]]]];
Colorize[comp]
The devil is in the details, but that sounds like a solid starting point if you wish to develop your own implementation.

Try math morphology. First you need to dilate or close your image to fill holes.
cvDilate(pimg, pimg, NULL, 3);
cvErode(pimg, pimg, NULL);
I got this image
The next step should be applying thinning algorithm. Unfortunately it's not implemented in OpenCV (MATLAB has bwmorph with thin argument). For example with MATLAB I refined the image to this one:
However OpenCV has all needed basic morphological operations to implement thinning (cvMorphologyEx, cvCreateStructuringElementEx, etc).
Another idea.
They say that distance transform seems to be very useful in such tasks. May be so.
Consider cvDistTransform function. It creates to an image like that:
Then using something like cvAdaptiveThreshold:
That's skeleton. I guess you can iterate over all connected white pixels, find curves and filter out small segments.

I've implemented a similar algorithm before, and I did it in a sort of incremental least-squares fashion. It worked fairly well. The pseudocode is somewhat like:
L = empty set of line segments
for each white pixel p
line = new line containing only p
C = empty set of points
P = set of all neighboring pixels of p
while P is not empty
n = first point in P
add n to C
remove n from P
line' = line with n added to it
perform a least squares fit of line'
if MSE(line) < max_mse and d(line, n) < max_distance
line = line'
add all neighbors of n that are not in C to P
if size(line) > min_num_points
add line to L
where MSE(line) is the mean-square-error of the line (sum over all points in the line of the squared distance to the best fitting line) and d(line,n) is the distance from point n to the line. Good values for max_distance seem to be a pixel or so and max_mse seems to be much less, and will depend on the average size of the line segments in your image. 0.1 or 0.2 pixels have worked in fairly large images for me.
I had been using this on actual images pre-processed with the Canny operator, so the only results I have are of that. Here's the result of the above algorithm on an image:
It's possible to make the algorithm fast, too. The C++ implementation I have (closed source enforced by my job, sorry, else I would give it to you) processed the above image in about 20 milliseconds. That includes application of the Canny operator for edge detection, so it should be even faster in your case.

You can start by extraction straight lines from your contours image using HoughLinesP which is provided with openCV:
HoughLinesP(InputArray image, OutputArray lines, double rho, double theta, int threshold, double minLineLength = 0, double maxLineGap = 0)
If you choose threshold = 1 and minLineLenght small, you can even obtain all single elements. Be careful though, since it yields many results in case you have many edge pixels.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js