A node with massive degree in graph brings taking distinct() edges trouble - mapreduce

I have a graph that around 75% connectivity comes from only one node
e.g. if the sum of degree of all nodes is 100, this node's degree is 75.
After some manipulations,
massive duplicate edges exist regarding this node.
Assume 1 is this kind of node
1,2
1,2
1,2
1,2
1,2
1,2
1,3
1,3
1,3
However, it has too many duplicate keys for taking distinct() edges.
I have tried re-partition before taking distinct() but it still doesn't work out of too many duplicate keys, and now writing into disk and then taking distinct() solves this problem.
Is there a better way to handle this kind of extremely skew problem?

Related

MST from an array of 2D triangles, but with a little twist

Here's an illustration of the steps taken thus far:
Pseudo-random rectangle generation
"Central node" insertion, rect separation, and final node selections
Delaunay triangulation (shown with previously selected nodes)
Rendering of triangle edges
At this point (Step 5), I would like to use this data to form a Minimum Spanning Tree, but there's a slight catch...
Somewhere in the graph (likely near the center, but not always) will be a node that requires between 3-5 connections to it from other unique nodes. This complicates things, since every other node should only contain a single connection, and the data structures being used make it difficult to determine "what's connected to what" in a solid, traversable format.
So, given an array of triangles in the above format, and a random vertex to use as the "root node", how could I properly traverse the network to create an MST where there are at least 3 connections to our "central node", but no more than 5 connections to it? Is this possible?
Since it's rare to have a vertex in a Delaunay triangulation have much more than 6 edges, you can use brute force: there are only 20+15+6 ways to select 3, 4, or 5 edges out of 6 (respectively), so try all of them. For each of the 41 (up to 336 for degree 9) small trees (the root and a few edges) thus created, run either Kruskal's algorithm or Prim's algorithm starting with that tree already "found" to be part of the MST. (Ignore the root's other edges so as not to increase its degree further.) Then just pick the best one (including the cost of the seed tree).
As for the general neighbor information problem, it seems you just need to build a standard graph representation first. For example, you can make an adjacency list by scanning over all your Edge objects and storing each endpoint in a list associated with the other (in a map<Vector2<T>,vector<Vector2<T>>> or an equivalent based on whatever identifiers for your vertices).
I've taken a workaround approach...
After step 3 of my algorithm, I simply remove all edges which connect to the "central node", keeping track of which edges form the "ring" (aka "edge loop") around it, and run the MST on all remaining edges.
For the MST, I went with the boost graph library.
That made it easy to loop through the triangles I had, adding each of its three edges to an adjacency_list. Then a simple call to whichever boost-provided MST algorithm took care of the rest.
Finally, I readd the edges that were previously taken out. The shortest path is whatever it was in the previous step, plus the length of whichever readded edge that connects to another edge on the "ring" is shortest.
I can then add (or remove) an arbitrary number of previous edges to ensure there are between 3 to 5 edges connecting from the edge loop to the "central node".
Doing things in this order also allows me to know as soon as step 3 if we'll even have a valid number of edges, so we don't waste cycles running a MST.

Is there any way to quickly find all the edges that are part of a cycle(back edges) in an undirected/directed graph?

I've got a minimum spanning tree. I add an edge to it. Surely a cycle is formed. I need to find all the edges that are part of that cycle ie., all the back edges. How quickly can that be done? My solution-
For example if it's edge (1,4), add 4 to Adj(1) at all places and run dfs every time. Eg. if Adj(1) had 2,3,5, first add 4 before 2, run DFS. I'll get a back edge. Then add 4 between 2 and 3 and run dfs. I get the another back edge. Then between 3 and 5 and so on. Is there any faster way to do this?
In a tree you have a single (simple) route between any pair of vertices. If you are adding an edge (i,j), first find the route in the tree between i and j and then you will have your cycle - it consists of all the vertices in that route(and turns into a cycle once you add (i,j) as edge).
You are looking for the strongly connected components of the graph, which can be found using Tarjan's algorithm (among others).

Minimal spanning tree using prim's algorithm, don't know what is wrong

First of all I'll state that I'm not asking for any code or complete solutions.
I'll describe the problem:
You are given number of rooms in a building and number of hallways between them. Every hallway connects 2 rooms and is given a weight. It is always possible to get to any room. You are supposed to reduce the complete weight of all hallways by removing them, printing out the weight reduced.
Are these assumptions correct?:
The building is a graph, rooms are vertices, hallways are edges connecting them. Therefore this is an undirected connected graph.
You can solve this by getting the weight of graph's minimal spanning tree, then doing complete weight minus the weight of MST - the result is sum of weights of hallways that can be removed.
I have implemented Prim's algorithm for the MST and the result is correct for the example case and for any other cases of MST that I found on the internet. However, the grading server still gives me "wrong answer" with no other information. I don't know what's wrong. There are no more than 100 vertices and 5000 edges in the input so the ranges should not be a problem. The weights are integers <=200. I'm using adjacency matrix for the MTS. Example input:
5 7
1 2 50
2 3 40
3 4 20
4 5 10
1 4 40
3 5 30
In this case the program prints 80. The complete weight is 190, minimal weight is 110, so we can remove 190 - 110 = 80
My questions are:
Are there any obvious mistakes that come to your mind? Things to watch out for, why does it work for the example input etc..
Are there any medium sized test cases for MST on the internet that I could use to find the problem?
Is there any other way to solve this problem? I would happily try anything with the grading server.
I'm completely new to graphs so I may be missing something.
The building is a graph, rooms are vertices, hallways are edges connecting them. Therefore this is an undirected connected graph.
You can solve this by getting the weight of graph's minimal spanning tree, then doing complete weight minus the weight of MST - the result is sum of weights of hallways that can be removed.
Yes, both of these are correct (modulo the nitpick that the building is not a graph with rooms as vertices and hallways as edges, but can be viewed as such). And if you view it thus, the difference between the total weight of the original graph and the total weight of a minimum spanning tree is the maximal possible reduction of weight without making some rooms unreachable from others (i.e. making the graph disconnected).
So I see two possibilities,
You have a subtle bug in your implementation of Prim's algorithm that is triggered by a testcase on the grading server but not by the testcases you checked.
The grading server has a wrong answer.
Without any further information, I would consider 1 more likely.
Is there any other way to solve this problem? I would happily try anything with the grading server.
Since you need to find the weight of an MST, I don't see how you could do it without finding an MST. So the other ways are different algorithms to find an MST. Kruskal's algorithm comes to mind.

Efficient algorithm to deal with big-data network files for computing n nearest nodes

Problem:
I have two network files with me (say NET1 and NET2) - each has a set of nodes with unique ID for each node and geographic coordinates X and Y. Each node in NET2 is to have n connections to NET1 and the ID of n nodes will be determined by the minimum straight line distance. The output will have three fields IDs of node in NET1, NET2 and the distance between them. All the files are in tab delimited format.
One way forward..
One way to implement this is for each node in NET2, we loop through each node in NET1 and compute all NET1-NET2 distance combinations. Sort it by NET2 node id and by distance and write out the first four records for each node. But the problem is there are close to 2 million nodes on NET1, 2000 nodes in NET2 - that is 4 billion distances to be calculated and written in the first step of this algorithm... and the runtime is quite forbidding!
Request:
I was curious if any of you folks out there has faced similar issue. I would love to hear from y'all about any algorithms and data structures that can be used to speed the processing. I know that the scope of this question is very broad but I hope someone can point me the right way as I have very limited experience optimizing codes for data of this scale.
Languages:
I am trying in C++, Python and R.
Please pitch in with ideas! Help greatly appreciated!
kd-tree is one of the options. It allows you to find nearest neighbor (or a set of nearest neighbors) in reasonable time. Of course, you have to build the tree in the beginning and it takes some time. But generally, kd-tree is suitable, if you don't have to add/remove nodes in runtime, which seems to be your case. It also has better performance with lower dimension (in your case the dimension is 2).
Another possible data structure is octree (quadtree for 2D), it's simpler data structure (quite easy to implement), but kd-tree can be more efficient.

How to find closed loops in graph networks

I have an undirected graph network made of streets and crossings, and I would like to know if there is any algorithm to help me finding closed loops, ie places where I can put buildings.
Any help appreciated, thanks !
Based on comments to my earlier answer:
It seems the graphs are all undirected and planar, i.e. can be embedded in a 2D plane without crossing edges, and one such embedding is given. This embedding will partition the plane. E.g. a figure 8 partitions the plane in three: two "inner" areas and an infinite outer area. An alternative view is that all edges of a node are cyclically ordered. (This is the essential part that allows us to apply graph theory)
A partition is necessarily enclosed by a cycle, but not all cycles may partition a single area. In the trivial case of a figure 8, though, all three areas are directly associated with a distinct cycle.
The input graph can generally be simplified. Some nodes may have only a single edge; they can't contribute to the partitioning and can be removed along with the edge. Other nodes have two edges connecting distinct nodes. Here, the node and the two edges can be replaced by a direct edge connecting the neighbors. I.e. a figure 8 graph can be simplified to two nodes and three edges between them. (This is not a necessary step but helps computation).
Now, each vertex will have two areas to either side (since they're undirected, "left and right" aren't obvious distinctions). So, for |V| vertices, we need to consider 2 * |V| sides. They're in general not distinct. Two adjacent edges (connected to the same node) may border the same area, if they're also adjacent in the cyclic order of edges of that node. Obviously, for nodes with only two edges, the two edges share both areas (which is why we'd eliminated them in the previous step). For nodes with three edges, any two edges share at least one area.
So, here's how to enumerate those areas: Assign a sequential number to all edges and vertices. Assign a direction to each edge so it runs from the lower-numbered edge to the higher. Start with vertex 1, right side, and number this area 1. Trace the boundary edges of this area, assigning the same number 1 to the appropriate sides of its boundary edges. You do this by taking at each node the next adjacent edge in counter-cyclical order. When you get back to your starting point, you know all edges bounding area 1.
You then check the left edge of the first vertex. If it's not part of area 1, then it's area 2, and you apply the same algorithm. Next, check vertex 2, right side and left side, etc. Each time you find an edge and a side that's unnumbered yet, assign the next area number and trace the edges of the newly founded area.
There's a slight problem with determining which area number corresponds to infinity. To see this, take a simple () graph: two edges, two nodes, and two areas (inside and outside). Due to the random numbering of edges and vertices, outside may end up as either 1 or 2. That's unavoidable; in graph theory there's no distinction between inside and outside.
It's a standard function in the Boost Graph library. See this previous answer for details.