Optimal way to find shared elements between combination pairs

Optimal way to find shared elements between combination pairs - c++

I have a list of ordered items of type A, who each contain a subset from a list of items B. For each pair of items in A, I would like to find the number of items B that they share (intersect).
For example, if I have this data:
A1 : B1
A2 : B1 B2 B3
A3 : B1
Then I would get the following result:
A1, A2 : 1
A1, A3 : 1
A2, A3 : 1
The problem I'm having is making the algorithm efficient. The size of my dataset is about 8.4K items of type A. This means 8.4K choose 2 = 35275800 combinations. The algorithm I'm using is simply going through each combination pair and doing a set intersection.
The gist of what I have so far is below. I am storing the counts as a key in a map, with the value as a vector of A pairs. I'm using a graph data structure to store the data, but the only 'graph' operation I'm using is get_neighbors() which returns the B subset for an item from A. I happen to know that the elements in the graph are ordered from index 0 to 8.4K.
void get_overlap(Graph& g, map<int, vector<A_pair> >& overlap) {
map<int, vector<A_pair> >::iterator it;
EdgeList el_i, el_j;
set<int> intersect;
size_t i, j;
VertexList vl = g.vertices();
for (i = 0; i < vl.size()-1; i++) {
el_i = g.get_neighbors(i);
for (j = i+1; j < vl.size(); j++) {
el_j = g.get_neighbors(j);
set_intersection(el_i.begin(), el_i.end(), el_j.begin(), el_j.end(), inserter(intersect, intersect.begin()));
int num_overlap = intersect.size();
it = overlap.find(num_overlap);
if (it == overlap.end()) {
vector<A_pair> temp;
temp.push_back(A_pair(i, j));
overlap.insert(pair<int, vector<A_pair> >(num_overlap, temp));
}
else {
vector<A_pair> temp = it->second;
temp.push_back(A_pair(i, j));
overlap[num_overlap] = temp;
}
}
}
}
I have been running this program for nearly 24 hours, and the ith element in the for loop has reached iteration 250 (I'm printing each i to a log file). This, of course, is a long way from 8.4K (although I know as iterations go on, the number of comparisons will shorten since j = i +1). Is there a more optimal approach?
Edit: To be clear, the goal here is ultimately to find the top k overlapped pairs.
Edit 2: Thanks to #Beta and others for pointing out optimizations. In particular, updating the map directly (instead of copying its contents and resetting the map value) drastically improved the performance. It now runs in a matter of seconds.

I think you may be able to make things faster by pre-computing a reverse (edge-to-vertex) map. This would allow you to avoid the set_intersection call, which performs a bunch of costly set insertions. I am missing some declarations to make fully functional code, but hopefully you will get the idea. I am assuming that EdgeList is some sort of int vector:
void get_overlap(Graph& g, map<int, vector<A_pair> >& overlap) {
map<int, vector<A_pair> >::iterator it;
EdgeList el_i, el_j;
set<int> intersect;
size_t i, j;
VertexList vl = g.vertices();
// compute reverse map
map<int, set<int>> reverseMap;
for (i = 0; i < vl.size()-1; i++) {
el_i = g.get_neighbors(i);
for (auto e : el_i) {
const auto findIt = reverseMap.find(e);
if (end(reverseMap) == findIt) {
reverseMap.emplace(e, set<int>({i})));
} else {
findIt->second.insert(i);
}
}
}
for (i = 0; i < vl.size()-1; i++) {
el_i = g.get_neighbors(i);
for (j = i+1; j < vl.size(); j++) {
el_j = g.get_neighbors(j);
int num_overlap = 0;
for (auto e: el_i) {
auto findIt = reverseMap.find(e);
if (end(reverseMap) != findIt) {
if (findIt->second.count(j) > 0) {
++num_overlap;
}
}
}
it = overlap.find(num_overlap);
if (it == overlap.end()) {
overlap.emplace(num_overlap, vector<A_pair>({ A_pair(i, j) }));
}
else {
it->second.push_back(A_pair(i,j));
}
}
}
I didn't do the precise performance analysis, but inside the double loop, you replace "At most 4N comparisons" + some costly set insertions (from set_intersection) with N*log(M)*log(E) comparisons, where N is the average number of edge per vertex, and M is the average number of vertex per edge, and E is the number of edges, so it could be beneficial depending on your data set.
Also, if your edge indexes are compact, then you can use a simplae vector rather than a map to represent the reverse map, which removed the log(E) performance cost.
One question, though. Since you're talking about vertices and edges, don't you have the additional constraint that edges always have 2 vertices ? This could simplify some computations.

Related

MO's Algorithm to find number of elements present in both array

I have 2 arrays, before[N+1](1 indexed) and after[] (subarray of before[]). Now for M Queries, I need to find how many elements of after[] are present in before[] for the given range l,r.
For example:
N = 5
Before: (2, 1, 3, 4, 5)
After: (1, 3, 4, 5)
M = 2
L = 1, R = 5 → 4 elements (1, 3, 4, 5) of after[] are present in between before[1] and before[5]
L = 2, R = 4 → 3 elements (1, 3, 4) of after[] are present in between before[2] and before[4]
I am trying to use MO's algorithm to find this.Following is my code :
using namespace std;
int N, Q;
// Variables, that hold current "state" of computation
long long current_answer;
long long cnt[100500];
// Array to store answers (because the order we achieve them is messed up)
long long answers[100500];
int BLOCK_SIZE;
// We will represent each query as three numbers: L, R, idx. Idx is
// the position (in original order) of this query.
pair< pair<int, int>, int> queries[100500];
// Essential part of Mo's algorithm: comparator, which we will
// use with std::sort. It is a function, which must return True
// if query x must come earlier than query y, and False otherwise.
inline bool mo_cmp(const pair< pair<int, int>, int> &x,
const pair< pair<int, int>, int> &y)
{
int block_x = x.first.first / BLOCK_SIZE;
int block_y = y.first.first / BLOCK_SIZE;
if(block_x != block_y)
return block_x < block_y;
return x.first.second < y.first.second;
}
// When adding a number, we first nullify it's effect on current
// answer, then update cnt array, then account for it's effect again.
inline void add(int x)
{
current_answer -= cnt[x] * cnt[x] * x;
cnt[x]++;
current_answer += cnt[x] * cnt[x] * x;
}
// Removing is much like adding.
inline void remove(int x)
{
current_answer -= cnt[x] * cnt[x] * x;
cnt[x]--;
current_answer += cnt[x] * cnt[x] * x;
}
int main()
{
cin.sync_with_stdio(false);
cin >> N >> Q; // Q- number of queries
BLOCK_SIZE = static_cast<int>(sqrt(N));
long long int before[N+1]; // 1 indexed
long long int after[] // subarray
// Read input queries, which are 0-indexed. Store each query's
// original position. We will use it when printing answer.
for(long long int i = 0; i < Q; i++) {
cin >> queries[i].first.first >> queries[i].first.second;
queries[i].second = i;
}
// Sort queries using Mo's special comparator we defined.
sort(queries, queries + Q, mo_cmp);
// Set up current segment [mo_left, mo_right].
int mo_left = 0, mo_right = -1;
for(long long int i = 0; i < Q; i++) {
// [left, right] is what query we must answer now.
int left = queries[i].first.first;
int right = queries[i].first.second;
// Usual part of applying Mo's algorithm: moving mo_left
// and mo_right.
while(mo_right < right) {
mo_right++;
add(after[mo_right]);
}
while(mo_right > right) {
remove(after[mo_right]);
mo_right--;
}
while(mo_left < left) {
remove(after[mo_left]);
mo_left++;
}
while(mo_left > left) {
mo_left--;
add(after[mo_left]);
}
// Store the answer into required position.
answers[queries[i].second] = current_answer;
}
// We output answers *after* we process all queries.
for(long long int i = 0; i < Q; i++)
cout << answers[i] << "\n";
Now the problem is I can't figure out how to define add function and remove function.
Can someone help me out with these functions ?

Note: I'll denote the given arrays as a and b.
Let's learn how to add a new position (move right by one). If a[r] is already there, you can just ignore it. Otherwise, we need to add a[r] and add the number of occurrences of b[r] in a so far to the answer. Finally, if b[r] is already in a, we need to add one to the answer. Note that we need two count to arrays to do that: one for the first array and one for the second.
We know how to add one position in O(1), so we're almost there. How do we handle deletions?
Let's assume that we want to remove a subsegment. We can easily modify the count arrays. But how do we restore the answer? Well, we don't. Your solution goes like this:
save the current answer
add a subsegment
answer the query
remove it (we take care about the count arrays and ignore the answer)
restore the saved answer
That's it. It would require rebuilding the structure when we move the left pointer to the next block, but it still requires O(N sqrt(N)) time in the worst case.
Note: it might be possible to recompute the answer directly using count arrays when we remove one position, but the way I showed above looks easier too me.

Create Minimum Spanning Tree from Adjacency Matrix using Prims Algorithm

I want to implement Prims algorithm to find the minimal spanning tree of a graph. I have written some code to start with what I think is the way to do it, but Im kind of stuck on how to complete this.
Right now, I have a matrix stored in matrix[i][j], which is stored as a vector>. I have also a list of IP address stored in the variable ip. (This becomes the labels of each column/row in the graph)
int n = 0;
for(int i = 0; i<ip.size();i++) // column
{
for(int j = ip.size()-1; j>n;j--)
{
if(matrix[i][j] > 0)
{
edgef test;
test.ip1 = ip[i];
test.ip2 = ip[j];
test.w = matrix[i][j];
add(test);
}
}
n++;
}
At the moment, this code will look into one column, and add all the weights associated with that column to a binary min heap. What I want to do is, dequeue an item from the heap and store it somewhere if it is the minimum edge weight.
void entry::add(edgef x)
{
int current, temp;
current = heap.size();
heap.push_back(x);
if(heap.size() > 1)
{
while(heap[current].w < heap[current/2].w) // if child is less than parent, min heap style
{
edgef temp = heap[current/2]; // swap
heap[current/2] = heap[current];
heap[current] = temp;
current = current/2;
}
}
}

Creating random undirected graph in C++

The issue is I need to create a random undirected graph to test the benchmark of Dijkstra's algorithm using an array and heap to store vertices. AFAIK a heap implementation shall be faster than an array when running on sparse and average graphs, however when it comes to dense graphs, the heap should became less efficient than an array.
I tried to write code that will produce a graph based on the input - number of vertices and total number of edges (maximum number of edges in undirected graph is n(n-1)/2).
On the entrance I divide the total number of edges by the number of vertices so that I have a const number of edges coming out from every single vertex. The graph is represented by an adjacency list. Here is what I came up with:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <list>
#include <set>
#define MAX 1000
#define MIN 1
class Vertex
{
public:
int Number;
int Distance;
Vertex(void);
Vertex(int, int);
~Vertex(void);
};
Vertex::Vertex(void)
{
Number = 0;
Distance = 0;
}
Vertex::Vertex(int C, int D)
{
Number = C;
Distance = D;
}
Vertex::~Vertex(void)
{
}
int main()
{
int VertexNumber, EdgeNumber;
while(scanf("%d %d", &VertexNumber, &EdgeNumber) > 0)
{
int EdgesFromVertex = (EdgeNumber/VertexNumber);
std::list<Vertex>* Graph = new std::list<Vertex> [VertexNumber];
srand(time(NULL));
int Distance, Neighbour;
bool Exist, First;
std::set<std::pair<int, int>> Added;
for(int i = 0; i < VertexNumber; i++)
{
for(int j = 0; j < EdgesFromVertex; j++)
{
First = true;
Exist = true;
while(First || Exist)
{
Neighbour = rand() % (VertexNumber - 1) + 0;
if(!Added.count(std::pair<int, int>(i, Neighbour)))
{
Added.insert(std::pair<int, int>(i, Neighbour));
Exist = false;
}
First = false;
}
}
First = true;
std::set<std::pair<int, int>>::iterator next = Added.begin();
for(std::set<std::pair<int, int>>::iterator it = Added.begin(); it != Added.end();)
{
if(!First)
Added.erase(next);
Distance = rand() % MAX + MIN;
Graph[it->first].push_back(Vertex(it->second, Distance));
Graph[it->second].push_back(Vertex(it->first, Distance));
std::set<std::pair<int, int>>::iterator next = it;
First = false;
}
}
// Dijkstra's implementation
}
return 0;
}
I get an error:
set iterator not dereferencable" when trying to create graph from set data.
I know it has something to do with erasing set elements on the fly, however I need to erase them asap to diminish memory usage.
Maybe there's a better way to create some undirectioned graph? Mine is pretty raw, but that's the best I came up with. I was thinking about making a directed graph which is easier task, but it doesn't ensure that every two vertices will be connected.
I would be grateful for any tips and solutions!

Piotry had basically the same idea I did, but he left off a step.
Only read half the matrix, and ignore you diagonal for writing values to. If you always want a node to have an edge to itself, add a one at the diagonal. If you always do not want a node to have an edge to itself, leave it as a zero.
You can read the other half of your matrix for a second graph for testing your implementation.

Look at the description of std::set::erase :
Iterator validity
Iterators, pointers and references referring to elements removed by
the function are invalidated.
All other iterators, pointers and
references keep their validity.
In your code, if next is equal to it, and you erase element of std::set by next, you can't use it. In this case you must (at least) change it and only after this keep using of it.

Optimizing a dijkstra implementation

QUESTION EDITED, now I only want to know if a queue can be used to improve the algorithm.
I have found this implementation of a mix cost max flow algorithm, which uses dijkstra: http://www.stanford.edu/~liszt90/acm/notebook.html#file2
Gonna paste it here in case it gets lost in the internet void:
// Implementation of min cost max flow algorithm using adjacency
// matrix (Edmonds and Karp 1972). This implementation keeps track of
// forward and reverse edges separately (so you can set cap[i][j] !=
// cap[j][i]). For a regular max flow, set all edge costs to 0.
//
// Running time, O(|V|^2) cost per augmentation
// max flow: O(|V|^3) augmentations
// min cost max flow: O(|V|^4 * MAX_EDGE_COST) augmentations
//
// INPUT:
// - graph, constructed using AddEdge()
// - source
// - sink
//
// OUTPUT:
// - (maximum flow value, minimum cost value)
// - To obtain the actual flow, look at positive values only.
#include <cmath>
#include <vector>
#include <iostream>
using namespace std;
typedef vector<int> VI;
typedef vector<VI> VVI;
typedef long long L;
typedef vector<L> VL;
typedef vector<VL> VVL;
typedef pair<int, int> PII;
typedef vector<PII> VPII;
const L INF = numeric_limits<L>::max() / 4;
struct MinCostMaxFlow {
int N;
VVL cap, flow, cost;
VI found;
VL dist, pi, width;
VPII dad;
MinCostMaxFlow(int N) :
N(N), cap(N, VL(N)), flow(N, VL(N)), cost(N, VL(N)),
found(N), dist(N), pi(N), width(N), dad(N) {}
void AddEdge(int from, int to, L cap, L cost) {
this->cap[from][to] = cap;
this->cost[from][to] = cost;
}
void Relax(int s, int k, L cap, L cost, int dir) {
L val = dist[s] + pi[s] - pi[k] + cost;
if (cap && val < dist[k]) {
dist[k] = val;
dad[k] = make_pair(s, dir);
width[k] = min(cap, width[s]);
}
}
L Dijkstra(int s, int t) {
fill(found.begin(), found.end(), false);
fill(dist.begin(), dist.end(), INF);
fill(width.begin(), width.end(), 0);
dist[s] = 0;
width[s] = INF;
while (s != -1) {
int best = -1;
found[s] = true;
for (int k = 0; k < N; k++) {
if (found[k]) continue;
Relax(s, k, cap[s][k] - flow[s][k], cost[s][k], 1);
Relax(s, k, flow[k][s], -cost[k][s], -1);
if (best == -1 || dist[k] < dist[best]) best = k;
}
s = best;
}
for (int k = 0; k < N; k++)
pi[k] = min(pi[k] + dist[k], INF);
return width[t];
}
pair<L, L> GetMaxFlow(int s, int t) {
L totflow = 0, totcost = 0;
while (L amt = Dijkstra(s, t)) {
totflow += amt;
for (int x = t; x != s; x = dad[x].first) {
if (dad[x].second == 1) {
flow[dad[x].first][x] += amt;
totcost += amt * cost[dad[x].first][x];
} else {
flow[x][dad[x].first] -= amt;
totcost -= amt * cost[x][dad[x].first];
}
}
}
return make_pair(totflow, totcost);
}
};
My question is if it can be improved by using a priority queue inside of Dijkstra(). I tried but I couldn't get it to work properly.
Actually I suspect that in Dijkstra it should be looping over adjacent nodes, not all nodes...
Thanks a lot.

Surely Dijkstra's algorithm can be improved by using minheap. After we put a vertex into shortest-path tree and process (i.e. label) all adjacent vertices, our next step is to select the vertex with smallest label, not yet in the tree.
This is where minheap comes to mind. Rather than sequentially scan through all vertices, we extract the min element from heap and restructure it, which takes O(logn) time vs O(n). Note that the heap is going to keep only those vertices that are not yet in the shortest-path tree. However we should be able to somehow modify vertices in the heap, if we update their labels.

I am not so sure using a priority queue to implement Dijkstra's algorithm will actually improve the run time because, while using a priority queue decreases the amount of time needed to find the vertex with minimum distance from the source (O(log V) with a priority queue vs. O(V) in the naive implementation), it also increases the amount of time needed to process a new edge (O(log V) with a priority queue vs. O(1) in the naive implementation).
Thus, for the naive implementation, the running time is O(V^2+E).
However, for the priority queue implementation, the running time is O(V log V+E log V).
For very dense graphs, E could be O(V^2), which means the naive implementation would have running time O(V^2+V^2)=O(V^2) while the priority queue implementation would have running time O(V log V+V^2 log V)=O(V^2 log V). Thus, as you can see, the priority queue implementation actually has a worse worst-case run time in the case of dense graphs.
Given the fact that the people writing the above implementation stored the edges as an adjacency matrix rather than using adjacency lists, it looks like the people who wrote this code were expecting the graph to be a dense graph with O(V^2) edges, so it makes sense that they would use the naive implementation over the priority queue implementation here.
For more info about running time of Dijkstra's algorithm, read up on this Wikipedia page.

C++ Mark for contiguous sections in a 3D array of objects

If we have a 3x3x3 array of objects, which contain two members: a boolean, and an integer; can anyone suggest an efficient way of marking this array in to contiguous chunks, based on the boolean value.
For example, if we picture it as a Rubix cube, and a middle slice was missing (everything on 1,x,x == false), could we mark the two outer slices as separate groups, by way of a unique group identifier on the int member.
The same needs to apply if the "slice" goes through 90 degrees, leaving an L shape and a strip.
Could it be done with very large 3D arrays using recursion? Could it be threaded.
I've hit the ground typing a few times so far but have ended up in a few dead ends and stack overflows.
Very grateful for any help, thanks.

It could be done that way:
struct A {int m_i; bool m_b;};
enum {ELimit = 3};
int neighbour_offsets_positive[3] = {1, ELimit, ELimit*ELimit};
A cube[ELimit][ELimit][ELimit];
A * first = &cube[0][0][0];
A * last = &cube[ELimit-1][ELimit-1][ELimit-1];
// Init 'cube'.
for(A * it = first; it <= last; ++it)
it->m_i = 0, it->m_b = true;
// Slice.
for(int i = 0; i != ELimit; ++i)
for(int j = 0; j != ELimit; ++j)
cube[1][i][j].m_b = false;
// Assign unique ids to coherent parts.
int id = 0;
for(A * it = first; it <= last; ++it)
{
if (it->m_b == false)
continue;
if (it->m_i == 0)
it->m_i = ++id;
for (int k = 0; k != 3; ++k)
{
A * neighbour = it + neighbour_offsets_positive[k];
if (neighbour <= last)
if (neighbour->m_b == true)
neighbour->m_i = it->m_i;
}
}

If I understand the term "contiguous chunk" correctly, i.e the maximal set of all those array elements for which there is a path from each vertex to all other vertices and they all share the same boolean value, then this is a problem of finding connected components in a graph which can be done with a simple DFS. Imagine that each array element is a vertex, and two vertices are connected if and only if 1) they share the same boolean value 2) they differ only by one coordinate and that difference is 1 by absolute value (i.e. they are adjacent)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimal way to find shared elements between combination pairs - c++

Related

MO's Algorithm to find number of elements present in both array

Create Minimum Spanning Tree from Adjacency Matrix using Prims Algorithm

Creating random undirected graph in C++

Optimizing a dijkstra implementation

C++ Mark for contiguous sections in a 3D array of objects

Categories

Resources