Finding most similar Range in an Array

Finding most similar Range in an Array - c++

I am finding A[i..j] that is the most similar to B.
Here calcSimilarity is function that returns similarity of two arrays.
Similarity is calculated as
Not than brute force search, I want to know what kind of data structure and algorithm is efficient in range search.
SAMPLE input/output
input: A: [(10,1), (20,1), (-200,2), (33,1), (42,1), (58,1)] B:[(20,1), (30,1), (1000,2)]
output: most similar Range is [1, 3]
match [20, 33] => [20, 30]
This is brute force search code.
struct object{
int type, value;
}A[10000],B[100];
int N, M;
int calcSimilarity(object X[], n, object Y[], m){
if(n > m) return calcSimilarity(Y, m, X, n);
for(all possible match){//match is (i, link[i])
int minDif = 0x7ffff;
int count = 0;
for( i = 0; i< n; i++){
int j = link[i];
int similar = similar(X[i], Y[j]);
minDif = min(similar, minDif);
}
}
if(count == 0) return 0x7fffff;
return minDif/pow(count,3);
}
find_most_similar_range(){
int minSimilar = 0x7fffff, minI, minJ;
for( i = 0; i < N; i ++){
for(j = i+1; j < N; j ++){
int similarity = calcSimilarity(A + i, j-i, B, M);
if (similarity < minSimilar)
{
minSimilar = similarity;
minI= i;
minJ = j;
}
}
}
printf("most similar Range is [%d, %d]", minI, minJ);
}

it will take O((N^M) * (N^2)).
That looks like the Big-O of the find similarity is N^2. With the pairwise comparison of each element.
So it looks more like
The pairwise comparison is M*(M-1). Each list has to be tested against each other list or about M^2.
This is a problem which has been solved for clustering, and there are data structures (e.g. Metric Tree), which allow the distances between similar objects to be stored in a tree.
When looking for the N closest neighbours, the search of this tree limits the number of pairwise comparisons needed and results in a O( ln(M) ) form
The downside of this particular tree, is the similarity measure needs to be metric. Where the distance between A and B, and the distance between B and C allows inferences to be made about the distance range of A and C.
If your similarity measure is not metric, then this can't be done.
Jaccard distance is a metric of distance which allows it to be placed in a Metric tree.

Related

Multiplying Matrices with two for loops in C++ [duplicate]

I came up with this algorithm for matrix multiplication. I read somewhere that matrix multiplication has a time complexity of o(n^2).
But I think my this algorithm will give o(n^3).
I don't know how to calculate time complexity of nested loops. So please correct me.
for i=1 to n
for j=1 to n
c[i][j]=0
for k=1 to n
c[i][j] = c[i][j]+a[i][k]*b[k][j]

Using linear algebra, there exist algorithms that achieve better complexity than the naive O(n3). Solvay Strassen algorithm achieves a complexity of O(n2.807) by reducing the number of multiplications required for each 2x2 sub-matrix from 8 to 7.
The fastest known matrix multiplication algorithm is Coppersmith-Winograd algorithm with a complexity of O(n2.3737). Unless the matrix is huge, these algorithms do not result in a vast difference in computation time. In practice, it is easier and faster to use parallel algorithms for matrix multiplication.

The naive algorithm, which is what you've got once you correct it as noted in comments, is O(n^3).
There do exist algorithms that reduce this somewhat, but you're not likely to find an O(n^2) implementation. I believe the question of the most efficient implementation is still open.
See this wikipedia article on Matrix Multiplication for more information.

The standard way of multiplying an m-by-n matrix by an n-by-p matrix has complexity O(mnp). If all of those are "n" to you, it's O(n^3), not O(n^2). EDIT: it will not be O(n^2) in the general case. But there are faster algorithms for particular types of matrices -- if you know more you may be able to do better.

In matrix multiplication there are 3 for loop, we are using since execution of each for loop requires time complexity O(n). So for three loops it becomes O(n^3)

I recently had a matrix multiplication problem in my college assignment, this is how I solved it in O(n^2).
import java.util.Scanner;
public class q10 {
public static int[][] multiplyMatrices(int[][] A, int[][] B) {
int ra = A.length; // rows in A
int ca = A[0].length; // columns in A
int rb = B.length; // rows in B
int cb = B[0].length; // columns in B
// if columns of A is not equal to rows of B, then the two matrices,
// cannot be multiplied.
if (ca != rb) {
System.out.println("Incorrect order, multiplication cannot be performed");
return A;
} else {
// AB is the product of A and B, and it will have rows,
// equal to rown in A and columns equal to columns in B
int[][] AB = new int[ra][cb];
int k = 0; // column number of matrix B, while multiplying
int entry; // = Aij, value in ith row and at jth index
for (int i = 0; i < A.length; i++) {
entry = 0;
k = 0;
for (int j = 0; j < A[i].length; j++) {
// to evaluate a new Aij, clear the earlier entry
if (j == 0) {
entry = 0;
}
int currA = A[i][j]; // number selected in matrix A
int currB = B[j][k]; // number selected in matrix B
entry += currA * currB; // adding to the current entry
// if we are done with all the columns for this entry,
// reset the loop for next one.
if (j + 1 == ca) {
j = -1;
// put the evaluated value at its position
AB[i][k] = entry;
// increase the column number of matrix B as we are done with this one
k++;
}
// if this row is done break this loop,
// move to next row.
if (k == cb) {
j = A[i].length;
}
}
}
return AB;
}
}
#SuppressWarnings({ "resource" })
public static void main(String[] args) {
Scanner ip = new Scanner(System.in);
System.out.println("Input order of first matrix (r x c):");
int ra = ip.nextInt();
int ca = ip.nextInt();
System.out.println("Input order of second matrix (r x c):");
int rb = ip.nextInt();
int cb = ip.nextInt();
int[][] A = new int[ra][ca];
int[][] B = new int[rb][cb];
System.out.println("Enter values in first matrix:");
for (int i = 0; i < ra; i++) {
for (int j = 0; j < ca; j++) {
A[i][j] = ip.nextInt();
}
}
System.out.println("Enter values in second matrix:");
for (int i = 0; i < rb; i++) {
for (int j = 0; j < cb; j++) {
B[i][j] = ip.nextInt();
}
}
int[][] AB = multiplyMatrices(A, B);
System.out.println("The product of first and second matrix is:");
for (int i = 0; i < AB.length; i++) {
for (int j = 0; j < AB[i].length; j++) {
System.out.print(AB[i][j] + " ");
}
System.out.println();
}
}
}

C++ algorithm optimization: find K combination from N elements

I am pretty noobie with C++ and am trying to do some HackerRank challenges as a way to work on that.
Right now I am trying to solve Angry Children problem: https://www.hackerrank.com/challenges/angry-children
Basically, it asks to create a program that given a set of N integer, finds the smallest possible "unfairness" for a K-length subset of that set. Unfairness is defined as the difference between the max and min of a K-length subset.
The way I'm going about it now is to find all K-length subsets and calculate their unfairness, keeping track of the smallest unfairness.
I wrote the following C++ program that seems to the problem correctly:
#include <cmath>
#include <cstdio>
#include <iostream>
using namespace std;
int unfairness = -1;
int N, K, minc, maxc, ufair;
int *candies, *subset;
void check() {
ufair = 0;
minc = subset[0];
maxc = subset[0];
for (int i = 0; i < K; i++) {
minc = min(minc,subset[i]);
maxc = max(maxc, subset[i]);
}
ufair = maxc - minc;
if (ufair < unfairness || unfairness == -1) {
unfairness = ufair;
}
}
void process(int subsetSize, int nextIndex) {
if (subsetSize == K) {
check();
} else {
for (int j = nextIndex; j < N; j++) {
subset[subsetSize] = candies[j];
process(subsetSize + 1, j + 1);
}
}
}
int main() {
cin >> N >> K;
candies = new int[N];
subset = new int[K];
for (int i = 0; i < N; i++)
cin >> candies[i];
process(0, 0);
cout << unfairness << endl;
return 0;
}
The problem is that HackerRank requires the program to come up with a solution within 3 seconds and that my program takes longer than that to find the solution for 12/16 of the test cases. For example, one of the test cases has N = 50 and K = 8; the program takes 8 seconds to find the solution on my machine. What can I do to optimize my algorithm? I am not very experienced with C++.

All you have to do is to sort all the numbers in ascending order and then get minimal a[i + K - 1] - a[i] for all i from 0 to N - K inclusively.
That is true, because in optimal subset all numbers are located successively in sorted array.

One suggestion I'd give is to sort the integer list before selecting subsets. This will dramatically reduce the number of subsets you need to examine. In fact, you don't even need to create subsets, simply look at the elements at index i (starting at 0) and i+k, and the lowest difference for all elements at i and i+k [in valid bounds] is your answer. So now instead of n choose k subsets (factorial runtime I believe) you just have to look at ~n subsets (linear runtime) and sorting (nlogn) becomes your bottleneck in performance.

Extract the n lowest sums from combinations of elements from m arrays for huge datasets

Let's say you have a number of unsorted arrays containing integers. Your job is to make sums of the arrays. The sums have to contain exactly one value from each array, i.e. (for 3 arrays)
sum = array1[2]+array2[12]+array3[4];
Goal: You should output the 20 combinations that generate the lowest possible sums.
The solution below is off-limits as the algorithm needs to be able to handle 10 arrays that can contain a huge number of integers. The following solution is way too slow for larger number of arrays:
//You already have int array1, array2 and array3
int top[20];
for(int i=0; i<20; i++)
top[i] = 1e99;
int sum = 0;
for(int i=0; i<array1.size(); i++) //One for loop per array is trouble for
for(int j=0; j<array2.size(); j++) //increasing numbers of arrays
for(int k=0; k<array3.size(); k++)
{
sum = array1[i] + array2[j] + array3[k];
if (sum < top[19])
swapFunction(sum, top); //Function that adds sum to top
//and sorts top in increasing order
}
printResults(top); // Outputs top 20 lowest sums in increasing order
What would you do to achieve correct results more efficiently (with a lower Big O notation)?

The answer can be found by considering how to find the absolute lowest sum, and how to find the 2nd lowest sum and so on.
As you only need 20 sums at most, you only need the lowest 20 values from each array at most. I would recommend using std::partial_sort for this.
The rest should be able to be accomplished with a priority_queue in which each element contains the current sum and the indicies of the arrays for this sum. Simply take each index of indicies and increase it by one, calculate the new sum and add that to the priority queue. The top most item of the queue should always be the one of the lowest sum. Remove the lowest sum, generate the next possibilities, and then repeat until you have enough answers.
Assuming that the number of answers needed is much less than Big O should be predominately be the efficiency of partial_sort (N + k*log(k)) * number of arrays
Here's some basic code to demonstrate the idea. There's very likely ways of improving on this. For example, I'm sure that with some work, you could avoid adding the same set of indicies multiple times, and there by eliminate the need for the do-while pop.
for (size_t i = 0; i < arrays.size(); i++)
{
auto b = arrays[i].begin();
partial_sort(b, b + numAnswers, arrays[i].end());
}
struct answer
{
answer(int s, vector<int> i)
: sum(s), indices(i)
{
}
int sum;
vector<int> indices;
bool operator <(const answer &o) const
{
return sum > o.sum;
}
};
auto getSum =[&arrays](const vector<int> &indices) {
auto retval = 0;
for (size_t i = 0; i < arrays.size(); i++)
{
retval += arrays[i][indices[i]];
}
return retval;
};
vector<int> initalIndices(arrays.size());
priority_queue<answer> q;
q.emplace(getSum(initalIndices), initalIndices );
for (auto i = 0; i < numAnswers; i++)
{
auto ans = q.top();
cout << ans.sum << endl;
do
{
q.pop();
} while (!q.empty() && q.top().indices == ans.indices);
for (size_t i = 0; i < ans.indices.size(); i++)
{
auto nextIndices = ans.indices;
nextIndices[i]++;
q.emplace(getSum(nextIndices), nextIndices);
}
}

min average of mutiple array segments

Given an array A I want to find the first index of a segment where the average of the chosen segment is the minimum among other segments.
Example: A = {1, 1, 3, 4, 5, 6, 7}
segment (1,2) = {1,1} ==> avg = 1+1/2 = 1
segment (1,3) = {1,1,3} ==> avg = 1+1+3/3 = 1.6
etc..
________________________________________________
input: {1, 1, 3, 4, 5, 6, 7}
output: 1
Explanation: the min avg is 1 hence the output should be the first index of that segment (1,2) which is: 1 in this case.
My current code looks like this:
int getIndex(vector<int> A)
{
if (A.size() <= 2) return 0; /*if array is a single segment then index:0 is the answer.*/
vector<int> psums; psums.push_back(A[0]);
for(size_t i =1; i< A.size(); i++)
{
psums.push_back(psums[i-1] + A[i]);
}
float min = 1111111111; /*assuming this is a max possible numb*/
int index;
float avg;
for(size_t i =1; i< psums.size(); i++)
{
for(size_t j = 0; j < i; j++)
{
avg = (float)((psums[i] - psums[j]) / (i-j+1));
if (min > std::min(min, avg))
{
min = std::min(min, avg);
index = j;
}
}
}
return index;
}
This code returns incorrect value. Thoughts?

Ok, I have some time, so here is the code you are hopefully looking for (and hopefully it compiles as well -- I have no chance to test that before I posted it, so give me a few minutes to check afterwards -- ok, compiles now with gcc-4.7.2):
#include<vector>
#include<tuple>
#include<functional>
#include<algorithm>
#include<iostream>
#include <numeric>
size_t getIndex(const std::vector<int>& A) //pass by const ref instead of by value!
{
size_t n=A.size();
//define vector to store the averages and bounds
std::vector<std::tuple<size_t, size_t, double> > store;
for(size_t iend=0;iend<n;++iend)
for(size_t ibegin=0;ibegin<iend;++ibegin) //changed from <= to < as segments need to have length >=2
{
//compute average: sum all elements from ibegin to iend to 0.0, cast to double and divide by the number of elements
double average=static_cast<double>(std::accumulate(A.begin()+ibegin, A.begin()+iend,0.0))/(iend-ibegin+1);
//store the result with bounds
store.push_back(std::make_tuple(ibegin,iend,average));
}
//define lambda which compares the tuple according to the third component
auto compare_third_element=[](const std::tuple<size_t,size_t,double> &t1, const std::tuple<size_t,size_t,double> &t2){return std::get<2>(t1)<std::get<2>(t2);};
//now find the iterator to the minimum element
auto it=std::min_element(store.begin(),store.end(),compare_third_element);
//print range
//std::cout<<"("<<std::get<0>(*it)<<","<<std::get<1>(*it)<<")"<<std::endl;
return std::get<0>(*it);
}
int main()
{
std::vector<int> A={1,1,2,3,4,5,6,7};
std::cout << getIndex(A);
return 0;
}
Caution: there might be more than a single segment which yields the minimum average. For the example above, the function prints the segment (0,0) on the screen, since it contains the minimum element 1. If you want to obtain the range you are looking for, either use std::nth_element to access the next entries, or change the comparison function (e.g. give longer tuples a higher priority).

Both (psums[i] - psums[j]) and (i-j+1) are integers. The division between them is the so called, incomplete division thus you get just the whole part, of the result. Cast one of the elements to float or double, like this:
(float)(psums[i] - psums[j])/(i-j+1)
The type of division in a/b depend on the type of a and b not the variable you put the result in!
Note: std::min(min, avg) is not required, instead of that just use: avg
Edit: psum[i]-psum[j]=A[i]+A[i-1]+...+A[j+1] becouse psum[i]=A[0]+A[1]+A[2]+..+A[i] so
the avg-line should look like this:
avg=(float)(psums[i] - psums[j] + A[j])/(i-j+1)

Interesting Problem (Currency arbitrage)

Arbitrage is the process of using discrepancies in currency exchange values to earn profit.
Consider a person who starts with some amount of currency X, goes through a series of exchanges and finally ends up with more amount of X(than he initially had).
Given n currencies and a table (nxn) of exchange rates, devise an algorithm that a person should use to avail maximum profit assuming that he doesn't perform one exchange more than once.
I have thought of a solution like this:
Use modified Dijkstra's algorithm to find single source longest product path.
This gives longest product path from source currency to each other currency.
Now, iterate over each other currency and multiply to the maximum product so far, w(curr,source)(weight of edge to source).
Select the maximum of all such paths.
While this appears good, i still doubt of correctness of this algorithm and the completeness of the problem.(i.e Is the problem NP-Complete?) as it somewhat resembles the traveling salesman problem.
Looking for your comments and better solutions(if any) for this problem.
Thanks.
EDIT:
Google search for this topic took me to this here, where arbitrage detection has been addressed but the exchanges for maximum arbitrage is not.This may serve a reference.

Dijkstra's cannot be used here because there is no way to modify Dijkstra's to return the longest path, rather than the shortest. In general, the longest path problem is in fact NP-complete as you suspected, and is related to the Travelling Salesman Problem as you suggested.
What you are looking for (as you know) is a cycle whose product of edge weights is greater than 1, i.e. w1 * w2 * w3 * ... > 1. We can reimagine this problem to change it to a sum instead of a product if we take the logs of both sides:
log (w1 * w2 * w3 ... ) > log(1)
=> log(w1) + log(w2) + log(w3) ... > 0
And if we take the negative log...
=> -log(w1) - log(w2) - log(w3) ... < 0 (note the inequality flipped)
So we are now just looking for a negative cycle in the graph, which can be solved using the Bellman-Ford algorithm (or, if you don't need the know the path, the Floyd-Warshall algorihtm)
First, we transform the graph:
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
w[i][j] = -log(w[i][j]);
Then we perform a standard Bellman-Ford
double dis[N], pre[N];
for (int i = 0; i < N; ++i)
dis[i] = INF, pre[i] = -1;
dis[source] = 0;
for (int k = 0; k < N; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
if (dis[i] + w[i][j] < dis[j])
dis[j] = dis[i] + w[i][j], pre[j] = i;
Now we check for negative cycles:
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
if (dis[i] + w[i][j] < dis[j])
// Node j is part of a negative cycle
You can then use the pre array to find the negative cycles. Start with pre[source] and work your way back.

The fact that it is an NP-hard problem doesn't really matter when there are only about 150 currencies currently in existence, and I suspect your FX broker will only let you trade at most 20 pairs anyway. My algorithm for n currencies is therefore:
Make a tree of depth n and branching factor n. The nodes of the tree are currencies and the root of the tree is your starting currency X. Each link between two nodes (currencies) has weight w, where w is the FX rate between the two currencies.
At each node you should also store the cumulative fx rate (calculated by multiplying all the FX rates above it in the tree together). This is the FX rate between the root (currency X) and the currency of this node.
Iterate through all the nodes in the tree that represent currency X (maybe you should keep a list of pointers to these nodes to speed up this stage of the algorithm). There will only be n^n of these (very inefficient in terms of big-O notation, but remember your n is about 20). The one with the highest cumulative FX rate is your best FX rate and (if it is positive) the path through the tree between these nodes represents an arbitrage cycle starting and ending at currency X.
Note that you can prune the tree (and so reduce the complexity from O(n^n) to O(n) by following these rules when generating the tree in step 1:
If you get to a node for currency X, don't generate any child nodes.
To reduce the branching factor from n to 1, at each node generate all n child nodes and only add the child node with the greatest cumulative FX rate (when converted back to currency X).

Imho, there is a simple mathematical structure to this problem that lends itself to a very simple O(N^3) Algorithm. Given a NxN table of currency pairs, the reduced row echelon form of the table should yield just 1 linearly independent row (i.e. all the other rows are multiples/linear combinations of the first row) if no arbitrage is possible.
We can just perform gaussian elimination and check if we get just 1 linearly independent row. If not, the extra linearly independent rows will give information about the number of pairs of currency available for arbitrage.

Take the log of the conversion rates. Then you are trying to find the cycle starting at X with the largest sum in a graph with positive, negative or zero-weighted edges. This is an NP-hard problem, as the simpler problem of finding the largest cycle in an unweighted graph is NP-hard.

Unless I totally messed this up, I believe my implementation works using Bellman-Ford algorithm:
#include <algorithm>
#include <cmath>
#include <iostream>
#include <vector>
std::vector<std::vector<double>> transform_matrix(std::vector<std::vector<double>>& matrix)
{
int n = matrix.size();
int m = matrix[0].size();
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m; ++j)
{
matrix[i][j] = log(matrix[i][j]);
}
}
return matrix;
}
bool is_arbitrage(std::vector<std::vector<double>>& currencies)
{
std::vector<std::vector<double>> tm = transform_matrix(currencies);
// Bellman-ford algorithm
int src = 0;
int n = tm.size();
std::vector<double> min_dist(n, INFINITY);
min_dist[src] = 0.0;
for (int i = 0; i < n - 1; ++i)
{
for (int j = 0; j < n; ++j)
{
for (int k = 0; k < n; ++k)
{
if (min_dist[k] > min_dist[j] + tm[j][k])
min_dist[k] = min_dist[j] + tm[j][k];
}
}
}
for (int j = 0; j < n; ++j)
{
for (int k = 0; k < n; ++k)
{
if (min_dist[k] > min_dist[j] + tm[j][k])
return true;
}
}
return false;
}
int main()
{
std::vector<std::vector<double>> currencies = { {1, 1.30, 1.6}, {.68, 1, 1.1}, {.6, .9, 1} };
if (is_arbitrage(currencies))
std::cout << "There exists an arbitrage!" << "\n";
else
std::cout << "There does not exist an arbitrage!" << "\n";
std::cin.get();
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js