Efficient trick for maximal bipartite matching in bigger graph - c++

Problem: We are given two arrays A & B of integers. Now in each step we are allowed to remove any 2 non co-prime integers each from the two arrays. We have to find the maximal number of pairs that can be removed by these steps.
Bounds:
length of A, B <=105
every integer <=109
Dinic's algorithm - O(V2E)
Edmonds-karp algorithm - O(VE2)
Hopcroft–Karp algorithm - O(E sqrt(V))
My approach up till now: This can be modeled as bipartite matching problem with two sets A and B and edges can be created between every non co-prime pair of integers from the corresponding set.
But the problem is that there can be O(V2) edges in the graph and most Bipartite matching and max-flow algorithms will be super slow for such large graphs.
I am looking for some problem specific or mathematical optimization that can solve the problem in reasonable time. To pass the test cases i need at most O(V log V) or O(V sqrt(V)) algorithm.
Thanks in advance.

You could try making a graph with vertices for:
A source
Every element in A
Every prime present in any number in A
Every element in B
A destination
Add directed edges with capacity 1 from source to elements in A, and from elements in B to destination.
Add directed edges with capacity 1 from each element x in A to every distinct prime in the prime factorisation of x.
Add directed edges with capacity 1 from each prime p to every element x in B where p divides x
Then solve for max flow from source to destination.
The numbers will have a small number of factors (at most 9 because 2.3.5.7.11.13.17.19.23.29 is bigger than 10**9), so you will have at most 1,800,000 edges in the middle.
This is much fewer than the 10,000,000,000 edges you could have had before (e.g. if all 100,000 entries in A and B were all even) so perhaps your max flow algorithm has a chance of meeting the time limit.

Related

Fast generation of random derangements

I am looking to generate derangements uniformly at random. In other words: shuffle a vector so that no element stays in its original place.
Requirements:
uniform sampling (each derangement is generated with equal probability)
a practical implementation is faster than the rejection method (i.e. keep generating random permutations until we find a derangement)
None of the answers I found so far are satisfactory in that they either don't sample uniformly (or fail to prove uniformity) or do not make a practical comparison with the rejection method. About 1/e = 37% of permutations are derangements, which gives a clue about what performance one might expect at best relative to the rejection method.
The only reference I found which makes a practical comparison is in this thesis which benchmarks 7.76 s for their proposed algorithm vs 8.25 s for the rejection method (see page 73). That's a speedup by a factor of only 1.06. I am wondering if something significantly better (> 1.5) is possible.
I could implement and verify various algorithms proposed in papers, and benchmark them. Doing this correctly would take quite a bit of time. I am hoping that someone has done it, and can give me a reference.
Here is an idea for an algorithm that may work for you. Generate the derangement in cycle notation. So (1 2) (3 4 5) represents the derangement 2 1 4 5 3. (That is (1 2) is a cycle and so is (3 4 5).)
Put the first element in the first place (in cycle notation you can always do this) and take a random permutation of the rest. Now we just need to find out where the parentheses go for the cycle lengths.
As https://mathoverflow.net/questions/130457/the-distribution-of-cycle-length-in-random-derangement notes, in a permutation, a random cycle is uniformly distributed in length. They are not randomly distributed in derangements. But the number of derangements of length m is m!/e rounded up for even m and down for odd m. So what we can do is pick a length uniformly distributed in the range 2..n and accept it with the probability that the remaining elements would, proceeding randomly, be a derangement. This cycle length will be correctly distributed. And then once we have the first cycle length, we repeat for the next until we are done.
The procedure done the way I described is simpler to implement but mathematically equivalent to taking a random derangement (by rejection), and writing down the first cycle only. Then repeating. It is therefore possible to prove that this produces all derangements with equal probability.
With this approach done naively, we will be taking an average of 3 rolls before accepting a length. However we then cut the problem in half on average. So the number of random numbers we need to generate for placing the parentheses is O(log(n)). Compared with the O(n) random numbers for constructing the permutation, this is a rounding error. However it can be optimized by noting that the highest probability for accepting is 0.5. So if we accept with twice the probability of randomly getting a derangement if we proceeded, our ratios will still be correct and we get rid of most of our rejections of cycle lengths.
If most of the time is spent in the random number generator, for large n this should run at approximately 3x the rate of the rejection method. In practice it won't be as good because switching from one representation to another is not actually free. But you should get speedups of the order of magnitude that you wanted.
this is just an idea but i think it can produce a uniformly distributed derangements.
but you need a helper buffer with max of around N/2 elements where N is the size of the items to be arranged.
first is to choose a random(1,N) position for value 1.
note: 1 to N instead of 0 to N-1 for simplicity.
then for value 2, position will be random(1,N-1) if 1 fall on position 2 and random(1,N-2) otherwise.
the algo will walk the list and count only the not-yet-used position until it reach the chosen random position for value 2, of course the position 2 will be skipped.
for value 3 the algo will check if position 3 is already used. if used, pos3 = random(1,N-2), if not, pos3 = random(1,N-3)
again, the algo will walk the list and count only the not-yet-used position until reach the count=pos3. and then position the value 3 there.
this will goes for the next values until totally placed all the values in positions.
and that will generate a uniform probability derangements.
the optimization will be focused on how the algo will reach pos# fast.
instead of walking the list to count the not-yet-used positions, the algo can used a somewhat heap like searching for the positions not yet used instead of counting and checking positions 1 by 1. or any other methods aside from heap-like searching. this is a separate problem to be solved: how to reached an unused item given it's position-count in a list of unused-items.
I'm curious ... and mathematically uninformed. So I ask innocently, why wouldn't a "simple shuffle" be sufficient?
for i from array_size downto 1: # assume zero-based arrays
j = random(0,i-1)
swap_elements(i,j)
Since the random function will never produce a value equal to i it will never leave an element where it started. Every element will be moved "somewhere else."
Let d(n) be the number of derangements of an array A of length n.
d(n) = (n-1) * (d(n-1) + d(n-2))
The d(n) arrangements are achieved by:
1. First, swapping A[0] with one of the remaining n-1 elements
2. Next, either deranging all n-1 remaning elements, or deranging
the n-2 remaining that excludes the index
that received A[0] from the initial matrix.
How can we generate a derangement uniformly at random?
1. Perform the swap of step 1 above.
2. Randomly decide which path we're taking in step 2,
with probability d(n-1)/(d(n-1)+d(n-2)) of deranging all remaining elements.
3. Recurse down to derangements of size 2-3 which are both precomputed.
Wikipedia has d(n) = floor(n!/e + 0.5) (exactly). You can use this to calculate the probability of step 2 exactly in constant time for small n. For larger n the factorial can be slow, but all you need is the ratio. It's approximately (n-1)/n. You can live with the approximation, or precompute and store the ratios up to the max n you're considering.
Note that (n-1)/n converges very quickly.

Finding All Intervals That Overlap a Point

Consider a large set of floating-point intervals in 1-dimension,
e.g.
[1.0, 2.5], 1.0 |---------------|2.5
[1.5, 3.6], 1.5|---------------------|3.6
.....
It is desired to find all intervals that contain a given point. For example given point = 1.2, algorithm should return the first interval, and if given point = 2.0, it should return the first two interval in the above example.
In the problem I am dealing, this operation needs to be repeated for a large number of times for a large number of intervals. Therefore a brute-force search is not desired and performance is an important factor.
After searching about it, I saw this problem is addressed using interval skip list in the context of computational geometry. I was wondering if there is any simple, efficient C++ implementation available.
EDIT: To be more precise about the problem, there are N intervals and for M points, it should be determined which intervals contain each point. N and M are large numbers where M is larger than N.
Suggest using CGAL range trees:
Wikipedia says interval trees (1-dimensional range trees) can "efficiently find all intervals that overlap with any given interval or point".
If your distribution of intervals allows it, it may be worth to consider a gridding approach: choose some grid size s and create an array of lists. Every k-th list enumerates the intervals that overlap with the "cell" [k.s, (k+1).s[.
Then a query amounts to finding the cell that contains the query point (in O(1)) and reporting all intervals in the list that effectively contain it (in O(K)).
Both preprocessing time and storage are O(I.L+G) where I is the number of intervals and L the average interval length in terms of the grid size and G the total number of grid cells. s must be chosen carefully.

Minimum Mean Weight Cycle - Intuitive Explanation

In a directed graph, we are looking for the cycle that had the lowest average edge weights. For instance, a graph with nodes 1 and 2 with path from 1 to 2 of length 2 and from 2 to 1 of length 4 would have minimum mean cycle of 3.
Not looking for a complicated method(Karp), but a simple backtracking wtih pruning solution. An explanation is given as "Solvable with backtracking with important pruning when current running mean is greater than the best found mean weight cycle cost."
However, why does this method work? If we are halfway through a cycle and the weight is more than the best found mean, isn't it possible that with small weight edges we can reach a situation where our current cycle can go lower than the best found mean?
Edit: Here is a sample problem: http://uva.onlinejudge.org/index.php?option=onlinejudge&page=show_problem&problem=2031
Lets optimal solution for given graph be a cycle with avg edge weight X.
There is some optimal cycle with edges e_1, e_2 ... e_n, such that avg(e_i) = X.
For my proof, I assume all indexes modulo n, so e_(n + 1) is e_1.
Lets say that our heuristic can't find this solution, that means: for each i (whatever edge we took first) exists such j (we followed all edges from i to j so far) that average edge weight in the sequence e_i ... e_j is greater than X (heuristic prunes this solution).
Then we can show that average edge weight cannot be equal to X. Lets take a longest contiguos subsequence that is not pruned by heuristic (having average edge weight not greater than X for every element). At least one e_i <= X, so such subsequence exists. For the first element e_k of such subsequence, there is p such that avg(e_k ... e_p) > X. We take first such p. Now lets take k' = p + 1 and get another p'. We will repeat this process until we hit our initial k again. Final p may not outrun initial k, this mean that final subsequence contains initial [e_k, e_p - 1], which contradicts with our construction for e_k. Now our sequence e_1 ... e_n is completely covered by non-overlapping subsequences e_k ... e_p, e_k'...e_p' etc, each of those has average edge weight greater than X. So we have a contradiction that avg(e_i) = X.
As for your question:
If we are halfway through a cycle and the weight is more than the best
found mean, isn't it possible that with small weight edges we can
reach a situation where our current cycle can go lower than the best
found mean?
Of course it is. But we can safely prune this solution, as later we will discover the same cycle starting from another edge, which will not be pruned. My proof states that if we consider every possible cycle in the graph, sooner or later we will find an optimal cycle.

Efficient maximum independent set within a graph with a maximum degree of 2

The limitations are of 100.000 (10^5) nodes and 2 or less edges per node
How could we get a maximum independent set for this graph in O(n) or O(n log n) time? Otherwise, it goes out on time. By the way, i just need to know the amount of points integrating the set, not necessarily the set of points itself.
I know of the greedy aproximation that works on O(n) which is picking nodes with the lowest number of degree, adding them to our set and then removing all its neighbors, repeating that until the graph is empty, and this aproximation works for many cases. Thing is, with these restrictions, isn't there any algorithm that always work?
On that class of graphs, if you greedily choose the node with the lowest degree and delete it and its neighbors, then you'll get an optimal solution, in linear time.

Counting Number of Paths (Length N) in Bipartite Graph

I am currently counting the number of paths of length $n$ in a bipartite graph by doing a depth first search (up to 10 levels). However, my implementation of this takes 5+ minutes to count 7 million paths of length 5 from a bipartite graph with 3000+ elements. I am looking for a more efficient way to do this counting problem, and I am wondering if there is any such algorithm in the literature.
These are undirected bipartite graphs, so there can be cycles in the paths.
My goal here is to count the number of paths of length $n$ in a bipartite graph of 1 million elements under a minute.
Thank you in advance for any suggested answers.
I agree with the first idea but it's not quite a BFS. In a BFS you go through each node once, here you can go a large number of times.
You have to keep 2 arrays (let's call it Cnt1, and Cnt2, Cnt1 is the number of times you have reached an element and you have a path of length i, and Cnt2 is the same but for length i + 1). First time all the elements are 0 in Cnt2 and 1 in Cnt1( because you have one path of length zero starting at each node).
Repeat N times:
Go through all the nodes
For the current node you go through all his connected nodes and for each you add at there position on Cnt2 the number of times you reached the current node in Cnt1.
When you finished all the nodes you just Copy Cnt2 in Cnt1 and make Cnt2 zero.
At the end you just add all the numbers of Cnt1 and that is the answer.
Convert to a breadth-first search, and whenever you have 2 paths that lead to the same node at the same length, just keep track of how many such ways there are and not how you got there.
This will avoid a lot of repeated work and should provide a significant speedup. (If n is not small, there are better speedups, read on.)
My goal here is to count the number of paths of length n in a bipartite graph of 1 million elements under a minute.
Um, good luck?
An alternate approach to look into is if you take the adjacency matrix of the graph, and raise it to the n'th power, all of the entries of the matrix you get are the number of paths of length end starting in one place, ending in another. So you can take shortcuts like repeated squaring. Convenient, isn't that?
Unfortunately a million element graph gives rise to an adjacency matrix with 10^12 entries. Multiplying two such matrices with a naive algorithm should require 10^18 operations. Of course we have better matrix multiplication algorithms, but you're still not getting below, say, 10^15 operations. Which will most assuredly not complete in 1 minute. (If your matrix is sparse enough you might have a chance, but you should do some researching on the topic.)