KNN search, growing set, arbitrary norm

KNN search, growing set, arbitrary norm - c++

Suppose the following problem in E3 with an arbitrary norm. For example, L1 norm is used (Hamming, Karlsruhe, geodesic, ..., are also applicable).
namespace {
typedef boost::tuple<double, double, double> Point;
double nL1(const Point& p1, const Point& p2) { //L1 norm, example
double dx = p1.get<0>() - p2.get<0>();
double dy = p1.get<1>() - p2.get<1>();
double dz = p1.get<2>() - p2.get<2>();
return abs(dx) + abs(dy) + abs(dz);
}
}
There are two sets A, B. A is initialized by n random points, where n is relatively large (n = 1*10^7 - 1*10^9), B is empty; see the sample code:
int main() {
using namespace std;
using namespace boost::lambda;
vector<Point> A, B;
int n = 10000000;
for (int i = 0; i < n; i++) //Create random points
A.push_back(boost::make_tuple(rand(), rand(), rand()));
Initially, we put a random point from A to B. In the simplified example, the first point A[0] is used:
B.push_back(A[0]);
Subsequently, for i = 1:n we repeat these steps:
Find the nearest point in B to A[i] according to a given norm
B * = argmin(|A[i]-B|)
If |B* - A[i]| < eps, then B.push_back(A[i]). In other words, add sufficiently close A[i] to B (Here, eps = 1.0*10^4 is used).
For the nearest search, I am using the std::partial_sort.
const int k = 1;
for (int i = 1; i < A.size(); i++) {
partial_sort(B.begin(), B.begin() + k, B.end(), bind(less<double>(), bind(nL1, _1, A[i]), bind(nL1, _2, A[i])));
if (nL1(*B.begin(), A[i]) < 1e4) B.push_back(A[i]); //Some threshold, eps=1.0*10^4
}
}
B is continuously growing and the search becomes more expensive... Being repeated in the loop, it is too slow, even for small sets (n=1*10^6)... Here, the partial sort is inefficient.
Are there significant improvements in speed? Of course, a naive approach can be used (but it is not faster).
How to speed up nn-search?
Another problem appears when also the second nearest points is required...
Current k-nn search libraries can not be used because of an arbitrary norm (the problem can be solved on the sphere). I tried to use nano-flann, but it does not support some specific norms...

Related

Composite Simpson's Rule in C++

I've been trying to write a function to approximate an the value of an integral using the Composite Simpson's Rule.
template <typename func_type>
double simp_rule(double a, double b, int n, func_type f){
int i = 1; double area = 0;
double n2 = n;
double h = (b-a)/(n2-1), x=a;
while(i <= n){
area = area + f(x)*pow(2,i%2 + 1)*h/3;
x+=h;
i++;
}
area -= (f(a) * h/3);
area -= (f(b) * h/3);
return area;
}
What I do is multiply each value of the function by either 2 or 4 (and h/3) with pow(2,i%2 + 1) and subtract off the edges as these should only have a weight of 1.
At first, I thought it worked just fine, however, when I compared it to my Trapezoidal Method function it was way more inaccurate which shouldn't be the case.
This is a simpler version of a code I previously wrote which had the same problem, I thought that if I cleaned it up a little the problem would go away, but alas. From another post, I get the idea that there's something going on with the types and the operations I'm doing on them which results in loss of precision, but I just don't see it.
Edit:
For completeness, I was running it for e^x from 1 to zero
\\function to be approximated
double f(double x){ double a = exp(x); return a; }
int main() {
int n = 11; //this method works best for odd values of n
double e = exp(1);
double exact = e-1; //value of integral of e^x from 0 to 1
cout << simp_rule(0,1,n,f) - exact;

The Simpson's Rule uses this approximation to estimate a definite integral:
Where
and
So that there are n + 1 equally spaced sample points xi.
In the posted code, the parameter n passed to the function appears to be the number of points where the function is sampled (while in the previous formula n is the number of intervals, that's not a problem).
The (constant) distance between the points is calculated correctly
double h = (b - a) / (n - 1);
The while loop used to sum the weighted contributes of all the points iterates from x = a up to a point with an ascissa close to b, but probably not exactly b, due to rounding errors. This implies that the last calculated value of f, f(x_n), may be slightly different from the expected f(b).
This is nothing, though, compared to the error caused by the fact that those end points are summed inside the loop with the starting weight of 4 and then subtracted after the loop with weight 1, while all the inner points have their weight switched. As a matter of fact, this is what the code calculates:
Also, using
pow(2, i%2 + 1)
To generate the sequence 4, 2, 4, 2, ..., 4 is a waste, in terms of efficency, and may add (depending on the implementation) other unnecessary rounding errors.
The following algorithm shows how to obtain the same (fixed) result, without a call to that library function.
template <typename func_type>
double simpson_rule(double a, double b,
int n, // Number of intervals
func_type f)
{
double h = (b - a) / n;
// Internal sample points, there should be n - 1 of them
double sum_odds = 0.0;
for (int i = 1; i < n; i += 2)
{
sum_odds += f(a + i * h);
}
double sum_evens = 0.0;
for (int i = 2; i < n; i += 2)
{
sum_evens += f(a + i * h);
}
return (f(a) + f(b) + 2 * sum_evens + 4 * sum_odds) * h / 3;
}
Note that this function requires the number of intervals (e.g. use 10 instead of 11 to obtain the same results of OP's function) to be passed, not the number of points.
Testable here.

The above excellent and accepted solution could benefit from liberal use of std::fma() and templatize on the floating point type.
https://en.cppreference.com/w/cpp/numeric/math/fma
#include <cmath>
template <typename fptype, typename func_type>
double simpson_rule(fptype a, fptype b,
int n, // Number of intervals
func_type f)
{
fptype h = (b - a) / n;
// Internal sample points, there should be n - 1 of them
fptype sum_odds = 0.0;
for (int i = 1; i < n; i += 2)
{
sum_odds += f(std::fma(i,h,a));
}
fptype sum_evens = 0.0;
for (int i = 2; i < n; i += 2)
{
sum_evens += f(std::fma(i,h,a);
}
return (std::fma(2,sum_evens,f(a)) +
std::fma(4,sum_odds,f(b))) * h / 3;
}

Sum of partial group of {𝑥^(0),𝑥(1),……,𝑥(𝑦)}

I want to write a program where input are x and y integer values
and then:
Let s be the set { x0, 𝑥1, …, 𝑥y}; store it in array.
Repeat:
Partition the set s into two subsets: s1 and s2.
Find the sum of each of the two subset and store them in variables like sum1, sum2.
Calculate the product of sum1 * sum2.
The program ends after passing all over the partial groups that could be formed and then prints the max value of the product sum1 * sum2.
example: suppose x=2 , y=3 s= {1,2,4,8} one of the divisions is to take s1 ={1,4} , s2={2,8} sum1=5 , sum2= 10 the product is 50 and that will be compared to other productd that were calculated in the same way like s1 ={1} , s2={2,4,8} sum1=1 , sum2=14 and the product is 14 and so on.
My code so far:
#include <iostream>
using namespace std;
int main ()
{
int a[10000]; // Max value expected.
int x;
int y;
cin >> x;
cin >> y;
int xexpy = 1;
int k;
for (int i = 0; i <= y; i++)
{
xexpy = 1;
k = i;
while(k > 0)
{
xexpy = xexpy * x;
k--;
}
cout << "\n" << xexpy;
a[i] = xexpy;
}
return 0;
}

This is not a programming problem, it is a combinatorics problem with a theoretical rather than an empirical approach to its solution. You can simply print the correct solution and not bother iterating over any partitions.
Why is that?
Let
i.e. z is the fraction of the sum of all s elements that's in s1. It holds that
and thus, the product of both sets satisfies:
As a function of z (not of x and y), this is a parabola that takes its maximum at z = 1/2; and there are no other local maximum points, i.e. getting closer to 1/2 necessarily increases that product. Thus what you want to do is partition the full set so that each of s1 and s2 are as close as possible to have half the sum of elements.
In general, you might have had to use programming to consider multiple subsets, but since your elements are given by a formula - and it's the formula of a geometric sequence.
First, let's assume x >= 2 and y >= 2, otherwise this is not an interesting problem.
Now, for x >= 2, we know that
(the sum of a geometric sequence), and thus
i.e. the last element always outweighs all other elements put together. That's why you always want to choose {xy} as s1 and as all other elements as s2. No need to run any program. You can then also easily calculate the optimum product-of-sums.
Note: If we don't make assumptions about the elements of s, except that they're non-negative integers, finding the optimum solution is an optimization version of the Partition problem - which is NP-complete. That means, very roughly, that there is no solution is fundamentally much more efficient than just trying all possible combinations.

Here's a cheesy all-combinations-of-supplied-arguments generator, provided without comment or explanation because I think this is homework, and the exercise of understanding how and why this does what it does is the point here.
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main(int c, const char **v)
{
basic_string<const char *> options(v);
auto N(options.length());
for (auto n = 1u; n < N; ++n) {
vector<char> pick(N);
fill_n(pick.rbegin(), n, 1);
do for (auto j=1u; j<N; ++j)
if (pick[j])
cout << options[j]<<' ';
while (cout<<'\n', next_permutation(begin(pick)+1, end(pick)));
}
}

my code is very slow compile when i set large number n. i do not know how to set loops

c++ compile very slow
for 2d vector
std::vector< vector<double> > V(n, vector<double> (n));
double sum2=0;
for(int i=0; i<n; i++)
{
double xai=xa1+i*dxa;
double dxr=(double)(xr2-xr1)/n;
double sum1=0;
for(int j=0; j<n; j++){
double xri=xr1+dxr*j;
V[i][j]=fun(xri,xai);
double rect1=V[i][j]*dxr;
sum1+=rect1;
}
double rect2=sum1*dxa;
sum2+=rect2;
}
return sum2;

this code is integrate 2dimension [ (1/2*pi)*exp(-xr^2/2)*exp(-xa^2/2)].
the integral for this equation equal to 1 at infinity limits so in c++ we have to increase limits and n to get the result equal to 1 as theory.
If we apply Newton–Cotes quadrature to the infinite integral
, we need to cut off the lower and upper boundary of this integral.
The integrand must be negligibly small at the cut-off points.
Which value did you selected ?
The integrand of your problem is Gaussian and is rapidly decreasing like this,
exp(-10*10/2) ~ 1.93 * 10^(-22)
which would be negligible in the present integration.
Thus, if we cut off lower and upper boundary by -10 and +10, respectively, and set enough points in this range, we should get precise result.
I actually got well precise result with 100x100 points using the following trapezoidal quadrature.
This quadrature is most simple one.
My test code is here.
1 dimensional integration:
template<typename F>
double integrate_trapezoidal(F func, std::size_t n, double lowerBnd, double upperBnd)
{
if(lowerBnd == upperBnd){
return 0.0;
}
auto integral = 0.0;
auto x = lowerBnd;
const auto dx = (upperBnd - lowerBnd)/n;
auto left = func(x);
for(std::size_t i = 0; i<n; ++i)
{
x += dx;
const auto right = func(x);
integral += (left + right);
left = right;
}
integral *= (0.5*dx);
return integral;
}
2 dimensional integration:
template<typename F>
double integrate_trapezoidal_2dim(
F func_2dim,
std::size_t n,
double x_lowerBnd, double x_upperBnd,
double y_lowerBnd, double y_upperBnd)
{
auto func = [&](double x)
{
return integrate_trapezoidal(
std::bind(func_2dim, x, std::placeholders::_1),
n, y_lowerBnd, y_upperBnd);
};
return integrate_trapezoidal(func, n, x_lowerBnd, x_upperBnd);
}
I am worried that you set finite but very large upper and lower boundary.
In that case, you need to set many points to increase the number of pints in the range of -10 < x < +10.
Finally, there are various quadratures for numerical integrations.
If you insert something function into this Gaussian integrand, then Hermite quadrature or the fast Gaussian transformation (FGT) should be recommended.

Find Middle points(Computational geometry c++)

This program should read for input an integer N then the x and y coordinates of the N points
and return the number of points that are the middle points of any two other points in the set.
First the program stores the points in an array then we loop throgh the points and calculate the distance between points[i] and every other point. We sort the points according to that distance then if we find that any two points have the same distance we check if point[i] is aligned with them if it is the case we store point[i] in the middles list.
We then get rid of doubles in the list and return the size of the list.
I submitted my solution and it doesn't work for all the cases. Please help:
#include <iostream>
#include <cmath>
#include <algorithm>
#include <list>
#include <stdio.h>
using namespace std;
struct Point
{
int x;
int y;
int distance;
};
bool PointSort(Point a,Point b);
bool colinear(Point a,Point b,Point c);
bool same_point (Point first, Point second);
int main()
{
list<Point> middles;
int N;scanf("%d", &N);
Point points[N];
Point points2[N];
for(int i=0;i<N;i++)
{ scanf("%d", &points[i].x);
scanf("%d", &points[i].y);
points2[i].x=points[i].x;
points2[i].y=points[i].y;
}
for(int i=0;i<N;i++)
{
for(int j=0;j<N;j++)
{
points2[j]=points[j];
}
for(int j=0;j<N;j++)
{
points2[j].distance=(points[i].x-points2[j].x)*(points[i].x- points2[j].x)+(points[i].y-points2[j].y)*(points[i].y-points2[j].y);
}
sort(points2,points2+N,&PointSort);
for(int j=0;j<N;j++)
{
int k=j+1;
while(points2[j].distance==points2[k].distance)
{
bool coli=colinear(points[i],points2[j],points2[k]);
if(coli){middles.push_back(points2[i]);}
k++;
}
}
}
middles.unique(same_point);
cout<<middles.size();
}
bool PointSort(Point a,Point b)
{
return a.distance<b.distance;
}
bool colinear(Point a,Point b,Point c)
{
return (a.x*(b.y-c.y)+b.x*(c.y-a.y)+c.x*(a.y-b.y))/2.0==0.0;
}
bool same_point (Point first, Point second)
{ return (first.x==second.x && first.y==second.y) ; }

You actually don't need to calculate distances to check if something is the midpoint. The coordinates of the midpoint between A and B is M=(A+B)/2. Or, to keep everything as an integer, A+B=2M where M is the midpoint. Here's a pseudocode solution for the problem:
for ( A=0; A<N-1; A++ ) {
for ( B=A+1; B<N; B++ ) {
M2 = A+B;
for ( C=0; C<N; C++ ) {
if ( C*2 == M2 ) {
// C is the midpoint of A and B
}
}
}
}

I see the following potential problems with your code:
Your code computes the distance squared (not the distance as stated) between pairs of points. Since the calculation is being done using integer arithmetic, there's a chance of arithmetic overflow.
Your code removes all midpoints found with duplicated x and y coordinates. But, is this what the problem statement requires? If duplicate points actually appear in the input stream, and happen to be midpoints of some other points, should the second and all subsequent duplicates be ignored? Also, if a point is duplicated three (or more) times in the input stream, how many midpoints does that count as? You should carefully check the problem statement to see how duplicates in the input stream should be counted and follow the requirements precisely.
Your check for collinearity looks wrong. You appear to be trying to take a 2d cross of (points[i] - points2[j]) with (points[i] - points2[k]), but this is not the correct way to do it. Here is how to take a 2d cross:
int cross2d(Point a, Point mid, Point c)
{
// Take the 2d cross product (a - mid) X (c - mid).
// 2d cross = (u.x * v.y - u.y * v.x) where u = (a-mid) and v=(c - mid)
int cross = (a.x - mid.x) * (c.y - mid.y) - (a.y - mid.y) * (c.x - mid.x);
return cross;
}
bool collinear(Point a, Point mid, Point c)
{
// Check for the points being collinear (or degenerate, i.e. return true if a == mid or mid == c).
return cross2d(a, mid, c) == 0;
}
Again, integer overflow is a potential problem for point triplets with large coordinates that are nearly perpendicular. And if you were not trying to take a 2d cross, what were you trying to do?
You're trying create an O(n-squared) algorithm by sorting the points by distance from some prospective midpoint. That's creditable, but since your code isn't working I would start by creating a naive O(n-cubed) algorithm that solves the problem straightforwardly. Then you can use that to unit-test your improved n-squared algorithm.
Adding some spacing into your mathematical expressions makes them easier to read.
So, to start you off, here's the naive n-cubed algorithm. Note that I am preserving duplicates in the input stream while avoiding double-counting of points that are midpoints of multiple pairs of points:
#include <iostream>
#include <cmath>
#include <algorithm>
#include <list>
#include <stdio.h>
using namespace std;
struct Point
{
int x;
int y;
int id;
};
bool is_middle(Point a, Point middle, Point c);
bool same_point_id(Point first, Point second);
int main()
{
list<Point> middles;
int N;
scanf("%d", &N);
// https://stackoverflow.com/questions/25437597/find-middle-pointscomputational-geometry-c
// This program should read for input an integer N then the x and y coordinates of the N points
// and return the number of points that are the middle points of any two other points in the set.
Point *points = new Point[N];
for(int i=0;i<N;i++)
{
scanf("%d", &points[i].x);
scanf("%d", &points[i].y);
points[i].id = i;
}
for(int i=0; i<N-2; i++)
{
for(int j=i+1; j<N-1; j++)
{
for(int k=j+1; k<N; k++)
{
// Check the problem requirement to determine how to count sets of three identical points in the input stream.
if (is_middle(points[i], points[j], points[k]))
middles.push_back(points[j]);
if (is_middle(points[j], points[k], points[i]))
middles.push_back(points[k]);
if (is_middle(points[k], points[i], points[j]))
middles.push_back(points[i]);
}
}
}
// Prevent the same input point from being counted multiple times.
middles.unique(same_point_id);
cout<<middles.size();
delete [] points;
}
bool is_middle(Point a, Point mid, Point c)
{
if (a.x - c.x != 2*(a.x - mid.x))
return false;
if (a.y - c.y != 2*(a.y - mid.y))
return false;
return true;
}
bool same_point_id(Point first, Point second)
{
return (first.id==second.id);
}
Update: If you do need an n-squared algorithm then sorting potential endpoints by distance squared from the midpoint isn't a bad idea. If you want to avoid potential arithmetic overflows, you can do calculate the distance squared in 64bit long long ints:
long long distance_squared(Point a, Point b)
{
long long dx = ((long long)a.x - (long long)b.x);
long long dy = ((long long)a.y - (long long)b.y);
return dx*dx + dy*dy;
}
On most platforms these will have more bits than a regular int -- and certainly not fewer.

Explain working of compareX in qsort() library function

I was searching for the closest pair code and i found this which has used qsort() library function. I basically didn't get the concept of how it's compare parameter works. Explanation related to this particular code will be more appreciated. Thanks.
#include <iostream>
#include <float.h>
#include <stdlib.h>
#include <math.h>
using namespace std;
// A structure to represent a Point in 2D plane
struct Point
{
int x, y;
};
/* Following two functions are needed for library function qsort().
Refer: http://www.cplusplus.com/reference/clibrary/cstdlib/qsort/ */
// Needed to sort array of points according to X coordinate
int compareX(const void* a, const void* b)
{
Point *p1 = (Point *)a, *p2 = (Point *)b;
return (p1->x - p2->x);
}
// Needed to sort array of points according to Y coordinate
int compareY(const void* a, const void* b)
{
Point *p1 = (Point *)a, *p2 = (Point *)b;
return (p1->y - p2->y);
}
// A utility function to find the distance between two points
float dist(Point p1, Point p2)
{
return sqrt( (p1.x - p2.x)*(p1.x - p2.x) +
(p1.y - p2.y)*(p1.y - p2.y)
);
}
// A Brute Force method to return the smallest distance between two points
// in P[] of size n
float bruteForce(Point P[], int n)
{
float min = FLT_MAX;
for (int i = 0; i < n; ++i)
for (int j = i+1; j < n; ++j)
if (dist(P[i], P[j]) < min)
min = dist(P[i], P[j]);
return min;
}
// A utility function to find minimum of two float values
float min(float x, float y)
{
return (x < y)? x : y;
}
// A utility function to find the distance beween the closest points of
// strip of given size. All points in strip[] are sorted accordint to
// y coordinate. They all have an upper bound on minimum distance as d.
// Note that this method seems to be a O(n^2) method, but it's a O(n)
// method as the inner loop runs at most 6 times
float stripClosest(Point strip[], int size, float d)
{
float min = d; // Initialize the minimum distance as d
// Pick all points one by one and try the next points till the difference
// between y coordinates is smaller than d.
// This is a proven fact that this loop runs at most 6 times
for (int i = 0; i < size; ++i)
for (int j = i+1; j < size && (strip[j].y - strip[i].y) < min; ++j)
if (dist(strip[i],strip[j]) < min)
min = dist(strip[i], strip[j]);
return min;
}
// A recursive function to find the smallest distance. The array Px contains
// all points sorted according to x coordinates and Py contains all points
// sorted according to y coordinates
float closestUtil(Point Px[], Point Py[], int n)
{
// If there are 2 or 3 points, then use brute force
if (n <= 3)
return bruteForce(Px, n);
// Find the middle point
int mid = n/2;
Point midPoint = Px[mid];
// Divide points in y sorted array around the vertical line.
// Assumption: All x coordinates are distinct.
Point Pyl[mid+1]; // y sorted points on left of vertical line
Point Pyr[n-mid-1]; // y sorted points on right of vertical line
int li = 0, ri = 0; // indexes of left and right subarrays
for (int i = 0; i < n; i++)
{
if (Py[i].x <= midPoint.x)
Pyl[li++] = Py[i];
else
Pyr[ri++] = Py[i];
}
// Consider the vertical line passing through the middle point
// calculate the smallest distance dl on left of middle point and
// dr on right side
float dl = closestUtil(Px, Pyl, mid);
float dr = closestUtil(Px + mid, Pyr, n-mid);
// Find the smaller of two distances
float d = min(dl, dr);
// Build an array strip[] that contains points close (closer than d)
// to the line passing through the middle point
Point strip[n];
int j = 0;
for (int i = 0; i < n; i++)
if (abs(Py[i].x - midPoint.x) < d)
strip[j] = Py[i], j++;
// Find the closest points in strip. Return the minimum of d and closest
// distance is strip[]
return min(d, stripClosest(strip, j, d) );
}
// The main functin that finds the smallest distance
// This method mainly uses closestUtil()
float closest(Point P[], int n)
{
Point Px[n];
Point Py[n];
for (int i = 0; i < n; i++)
{
Px[i] = P[i];
Py[i] = P[i];
}
qsort(Px, n, sizeof(Point), compareX);
qsort(Py, n, sizeof(Point), compareY);
// Use recursive function closestUtil() to find the smallest distance
return closestUtil(Px, Py, n);
}
// Driver program to test above functions
int main()
{
Point P[] = {{2, 3}, {12, 30}, {40, 50}, {5, 1}, {12, 10}, {3, 4}};
int n = sizeof(P) / sizeof(P[0]);
cout << "The smallest distance is " << closest(P, n);
return 0;
}

The last parameter of qsort is a pointer to a function with a specific signature: it must take two void* pointers, and return an int that indicates which of the two passed items is smaller or if the two items are the same. The specifics are here, but generally a positive result indicates that the second item is smaller, a negative indicates that the first item is smaller, and zero indicates the equaliity.
The implementation of compareX
int compareX(const void* a, const void* b)
{
Point *p1 = (Point *)a, *p2 = (Point *)b;
return (p1->x - p2->x);
}
follows the general pattern for comparison functions. First, it converts the void* pointer to the Point type, because it "knows" that it is used together with an array of Point structures. Then it subtracts the x coordinates of the two points:
p1->x - p2->x
Note that the result of the subtraction is going to be positive if the second point's x is smaller, negative when the second point's x is greater, and zero when the two xs are the same. This is precisely what qsort wants the cmp function to do, so the subtraction operation fulfills the contract of the comparison function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js