How to generate the best curved fit for an unknown set of 2D points in C++ - c++

I am trying to get the best fit for an unknown set of 2D points. The points are centered points of rivers, and they don't come in a certain order.
I've tried to use polynomial regression but I don't know what is the best polynomial order for different sets of data.
I've also tried cubic spline, but I don't want a line through all the points I have, I want an approximation of the best fit line through the points.
I would like to get something like this even for lines that have more curves. This example is computed with polynomial regression, and it works fine.
Is there a way to do some smooth or regression algorithm that can get the best fit line even for a set of points like the following?
PolynomialRegression<double> pol;
static int polynomOrder = <whateverPolynomOrderFitsBetter>;
double error = 0.005f;
std::vector<double> coeffs;
pol.fitIt(x, y, polynomOrder, coeffs);
// get fitted values
for(std::size_t i = 0; i < points.size(); i++)
{
int order = polynomOrder;
long double yFitted = 0;
while(order >= 0)
{
yFitted += (coeffs[order] * pow(points[i].x, order) + error);
order --;
}
points[i].y = yFitted;
}
In my implementation with an 35 polynomial order this is all I can get , and also changing the polynomial order with higher values turns into Nan values for the coefficients.
I'm not sure if this is the best approach I can have.

Related

OpenCV least square (solve) solution accuracy

I am using the OpenCV method solve (https://docs.opencv.org/2.4/modules/core/doc/operations_on_arrays.html#solve) in C++ to fit a curve (grade 3, ax^3+bx^2+cx+d) through a set of points. I am solving A * x = B, A contain the powers of the points x-coordinates (so x^3, x^2, x^1, 1), and B contains the y coordinates of the points, x (Matrix) contains the parameters a, b, c and d.
I am using the flag DECOMP_QR on cv::solve to fit the curve.
The problem I am facing is that the set of points do not neccessarily follow a mathematical function (e.g. the function changes it's equation, see picture). So, in order to fit an accurate curve, I need to split the set of points where the curvature changes. In case of the picture below, I would split the regression at the index where the curve starts. So I need to detect where the curvature changes.
So, if I don't split, I'll get the yellow curve as a result, which is inaccurate. What I want is the blue curve.
Finding curvature changes:
To find out where the curvature changes, I want to use the solution accuracy.
So basically:
int splitIndex = 0;
for(int pointIndex = 0; pointIndex < numberOfPoints; pointIndex += 5) {
cv::Range rowR = Range(0, pointIndex); //Selected rows to index
cv::Range colR = Range(0,3); //Grade: 3 (x^3)
cv::Mat x;
bool res = cv::solve(A(rowR, colR), B(rowR, Range(0,1)),x , DECOMP_QR);
if(res == true) {
//Check for accuracy
if (accuracy too bad) {
splitIndex = pointIndex;
return splitIndex;
}
}
}
My questions are:
- is there a way of getting the accuracy / standard deviation from the solve command (efficiently & fast, because of real-time application (around 1ms compute time left))
- is this a good way of finding the curvature change / does anyone know a better way?
Thanks :)

Subsampling an array of numbers

I have a series of 100 integer values which I need to reduce/subsample to 77 values for the purpose of fitting into a predefined space on screen. This gives a fraction of 77/100 values-per-pixel - not very neat.
Assuming the 77 is fixed and cannot be changed, what are some typical techniques for subsampling 100 numbers down to 77. I get a sense that it will be a jagged mapping, by which I mean the first new value is the average of [0, 1] then the next value is [3], then average [4, 5] etc. But how do I approach getting the pattern for this mapping?
I am working in C++, although I'm more interested in the technique than implementation.
Thanks in advance.
Either if you downsample or you oversample, you are trying to reconstruct a signal over nonsampled points in time... so you have to make some assumptions.
The sampling theorem tells you that if you sample a signal knowing that it has no frequency components over half the sampling frequency, you can continously and completely recover the signal over the whole timing period. There's a way to reconstruct the signal using sinc() functions (this is sin(x)/x)
sinc() (indeed sin(M_PI/Sampling_period*x)/M_PI/x) is a function that has the following properties:
Its value is 1 for x == 0.0 and 0 for x == k*Sampling_period with k == 0, +-1, +-2, ...
It has no frequency component over half of the sampling_frequency derived from Sampling_period.
So if you consider the sum of the functions F_x(x) = Y[k]*sinc(x/Sampling_period - k) to be the sinc function that equals the sampling value at position k and 0 at other sampling value and sum over all k in your sample, you'll get the best continous function that has the properties of not having components on frequencies over half the sampling frequency and have the same values as your samples set.
Said this, you can resample this function at whatever position you like, getting the best way to resample your data.
This is by far, a complicated way of resampling data, (it has also the problem of not being causal, so it cannot be implemented in real time) and you have several methods used in the past to simplify the interpolation. you have to constructo all the sinc functions for each sample point and add them together. Then you have to resample the resultant function to the new sampling points and give that as a result.
Next is an example of the interpolation method just described. It accepts some input data (in_sz samples) and output interpolated data with the method described before (I supposed the extremums coincide, which makes N+1 samples equal N+1 samples, and this makes the somewhat intrincate calculations of (in_sz - 1)/(out_sz - 1) in the code (change to in_sz/out_sz if you want to make plain N samples -> M samples conversion:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
/* normalized sinc function */
double sinc(double x)
{
x *= M_PI;
if (x == 0.0) return 1.0;
return sin(x)/x;
} /* sinc */
/* interpolate a function made of in samples at point x */
double sinc_approx(double in[], size_t in_sz, double x)
{
int i;
double res = 0.0;
for (i = 0; i < in_sz; i++)
res += in[i] * sinc(x - i);
return res;
} /* sinc_approx */
/* do the actual resampling. Change (in_sz - 1)/(out_sz - 1) if you
* don't want the initial and final samples coincide, as is done here.
*/
void resample_sinc(
double in[],
size_t in_sz,
double out[],
size_t out_sz)
{
int i;
double dx = (double) (in_sz-1) / (out_sz-1);
for (i = 0; i < out_sz; i++)
out[i] = sinc_approx(in, in_sz, i*dx);
}
/* test case */
int main()
{
double in[] = {
0.0, 1.0, 0.5, 0.2, 0.1, 0.0,
};
const size_t in_sz = sizeof in / sizeof in[0];
const size_t out_sz = 5;
double out[out_sz];
int i;
for (i = 0; i < in_sz; i++)
printf("in[%d] = %.6f\n", i, in[i]);
resample_sinc(in, in_sz, out, out_sz);
for (i = 0; i < out_sz; i++)
printf("out[%.6f] = %.6f\n", (double) i * (in_sz-1)/(out_sz-1), out[i]);
return EXIT_SUCCESS;
} /* main */
There are different ways of interpolation (see wikipedia)
The linear one would be something like:
std::array<int, 77> sampling(const std::array<int, 100>& a)
{
std::array<int, 77> res;
for (int i = 0; i != 76; ++i) {
int index = i * 99 / 76;
int p = i * 99 % 76;
res[i] = ((p * a[index + 1]) + ((76 - p) * a[index])) / 76;
}
res[76] = a[99]; // done outside of loop to avoid out of bound access (0 * a[100])
return res;
}
Live example
Create 77 new pixels based on the weighted average of their positions.
As a toy example, think about the 3 pixel case which you want to subsample to 2.
Original (denote as multidimensional array original with RGB as [0, 1, 2]):
|----|----|----|
Subsample (denote as multidimensional array subsample with RGB as [0, 1, 2]):
|------|------|
Here, it is intuitive to see that the first subsample seems like 2/3 of the first original pixel and 1/3 of the next.
For the first subsample pixel, subsample[0], you make it the RGB average of the m original pixels that overlap, in this case original[0] and original[1]. But we do so in weighted fashion.
subsample[0][0] = original[0][0] * 2/3 + original[1][0] * 1/3 # for red
subsample[0][1] = original[0][1] * 2/3 + original[1][1] * 1/3 # for green
subsample[0][2] = original[0][2] * 2/3 + original[1][2] * 1/3 # for blue
In this example original[1][2] is the green component of the second original pixel.
Keep in mind for different subsampling you'll have to determine the set of original cells that contribute to the subsample, and then normalize to find the relative weights of each.
There are much more complex graphics techniques, but this one is simple and works.
Everything depends on what you wish to do with the data - how do you want to visualize it.
A very simple approach would be to render to a 100-wide image, and then smooth scale the image down to a narrower size. Whatever graphics/development framework you're using will surely support such an operation.
Say, though, that your goal might be to retain certain qualities of the data, such as minima and maxima. In such a case, for each bin, you're drawing a line of darker color up to the minimum value, and then continue with a lighter color up to the maximum. Or, you could, instead of just putting a pixel at the average value, you draw a line from the minimum to the maximum.
Finally, you might wish to render as if you had 77 values only - then the goal is to somehow transform the 100 values down to 77. This will imply some kind of an interpolation. Linear or quadratic interpolation is easy, but adds distortions to the signal. Ideally, you'd probably want to throw a sinc interpolator at the problem. A good list of them can be found here. For theoretical background, look here.

Total Least Squares algorithm in C/C++

Given a set of points P I need to find a line L that best approximates these points. I have tried to use the function gsl_fit_linear from the GNU scientific library. However my data set often contains points that have a line of best fit with undefined slope (x=c), thus gsl_fit_linear returns NaN. It is my understanding that it is best to use total least squares for this sort of thing because it is fast, robust and it gives the equation in terms of r and theta (so x=c can still be represented). I can't seem to find any C/C++ code out there currently for this problem. Does anyone know of a library or something that I can use? I've read a few research papers on this but the topic is still a little fizzy so I don't feel confident implementing my own.
Update:
I made a first attempt at programming my own with armadillo using the given code on this wikipedia page. Alas I have so far been unsuccessful.
This is what I have so far:
void pointsToLine(vector<Point> P)
{
Row<double> x(P.size());
Row<double> y(P.size());
for (int i = 0; i < P.size(); i++)
{
x << P[i].x;
y << P[i].y;
}
int m = P.size();
int n = x.n_cols;
mat Z = join_rows(x, y);
mat U;
vec s;
mat V;
svd(U, s, V, Z);
mat VXY = V(span(0, (n-1)), span(n, (V.n_cols-1)));
mat VYY = V(span(n, (V.n_rows-1)) , span(n, (V.n_cols-1)));
mat B = (-1*VXY) / VYY;
cout << B << endl;
}
the output from B is always 0.5504, Even when my data set changes. As well I thought that the output should be two values, so I'm definitely doing something very wrong.
Thanks!
To find the line that minimises the sum of the squares of the (orthogonal) distances from the line, you can proceed as follows:
The line is the set of points p+r*t where p and t are vectors to be found, and r varies along the line. We restrict t to be unit length. While there is another, simpler, description in two dimensions, this one works with any dimension.
The steps are
1/ compute the mean p of the points
2/ accumulate the covariance matrix C
C = Sum{ i | (q[i]-p)*(q[i]-p)' } / N
(where you have N points and ' denotes transpose)
3/ diagonalise C and take as t the eigenvector corresponding to the largest eigenvalue.
All this can be justified, starting from the (orthogonal) distance squared of a point q from a line represented as above, which is
d2(q) = q'*q - ((q-p)'*t)^2

C++ - Efficient way to compare vectors

At the moment i'm working with a camera to detect a marker. I use opencv and the Aruco Libary.
Only I'm stuck with a problem right now. I need to detect if the distance between 2 marker is less than a specific value. I have a function to calculate the distance, I can compare everything. But I'm looking for the most efficient way to keep track of all the markers (around 5/6) and how close they are together.
There is a list with markers but I cant find a efficient way to compare all of them.
I have a
Vector <Marker>
I also have a function called getDistance.
double getDistance(cv::Point2f punt1, cv::Point2f punt2)
{
float xd = punt2.x-punt1.x;
float yd = punt2.y-punt1.y;
double Distance = sqrtf(xd*xd + yd*yd);
return Distance;
}
The Markers contain a Point2f, so i can compare them easily.
One way to increase performance is to keep all the distances squared and avoid using the square root function. If you square the specific value you are checking against then this should work fine.
There isn't really a lot to recommend. If I understand the question and I'm counting the pairs correctly, you'll need to calculate 10 distances when you have 5 points, and 15 distances when you have 6 points. If you need to determine all of the distances, then you have no choice but to calculate all of the distances. I don't see any way around that. The only advice I can give is to make sure you calculate the distance between each pair only once (e.g., once you know the distance between points A and B, you don't need to calculate the distance between B and A).
It might be possible to sort the vector in such a way that you can short circuit your loop. For instance, if you sort it correctly and the distance between point A and point B is larger than your threshold, then the distances between A and C and A and D will also be larger than the threshold. But keep in mind that sorting isn't free, and it's likely that for small sets of points it would be faster to just calculate all distances ("Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. ... For example, binary trees are always faster than splay trees for workaday problems.").
Newer versions of the C and C++ standard library have a hypot function for calculating distance between points:
#include <cmath>
double getDistance(cv::Point2f punt1, cv::Point2f punt2)
{
return std::hypot(punt2.x - punt1.x, punt2.y - punt1.y);
}
It's not necessarily faster, but it should be implemented in a way that avoids overflow when the points are far apart.
One minor optimization is to simply check if the change in X or change in Y exceeds the threshold. If it does, you can ignore the distance between those two points because the overall distance will also exceed the threshold:
const double threshold = ...;
std::vector<cv::Point2f> points;
// populate points
...
for (auto i = points.begin(); i != points.end(); ++i) {
for (auto j = i + 1; j != points.end(); ++j) {
double dx = std::abs(i->x - j->x), dy = std::abs(i->y - j->y);
if (dx > threshold || dy > threshold) {
continue;
}
double distance = std::hypot(dx, dy);
if (distance > threshold) {
continue;
}
...
}
}
If you're dealing with large amounts of data inside your vector you may want to consider some multithreading using future.
Vector <Marker> could be chunked into X chunks which are asynchronously computed together and stored inside std::future<>, putting to use #Sesame's suggestion will also increase your speed as well.

How to optimize interpolation of large set of scattered points?

I am currently working with a set of coordinate points (longitude, latitude, about 60000 of them) and the temperature at that location. I need to do a interpolation on them to compute the values at some points with unknown temperature as to map certain regions.
As to respect the influence that the points have between them I have converted every (long, lat) point to a unit sphere point (x, y, z).
I have started applying the generalized multidimension Shepard interpolation from "Numerical recipes 3rd Edition":
Doub interp(VecDoub_I &pt)
{
Doub r, w, sum=0., sumw=0.;
if (pt.size() != dim)
throw("RBF_interp bad pt size");
for (Int i=0;i<n;i++)
{
if ((r=rad(&pt[0],&pts[i][0])) == 0.)
return vals[i];
sum += (w = pow(r,pneg));
sumw += w*vals[i];
}
return sumw/sum;
}
Doub rad(const Doub *p1, const Doub *p2)
{
Doub sum = 0.;
for (Int i=0;i<dim;i++)
sum += SQR(p1[i]-p2[i]);
return sqrt(sum);
}
As you can see, for the interpolation of one point, the algorithm computes the distance of that point to each of the other points and taking it as a weight in the final value.
Even though this algorithm works it is much too slow compared to what I need since I will be computing a lot of points to map a grid of a certain region.
One way of optimizing this is that I could leave out the points than are beyond a certain radius, but would pose a problem for areas with few or no points.
Another thing would be to reduce the computing of the distance between each 2 points by only computing once a Look-up Table and storing the distances. The problem with this is that it is impossible to store such a large matrix (60000 x 60000).
The grid of temperatures that is obtained, will be used to compute contours for different temperature values.
If anyone knows a way to optimize this algorithm or maybe help with a better one, I will appreciate it.
Radial basis functions with infinite support is probably not what you want to be using if you have a large number of data points and will be taking a large number of interpolation values.
There are variants that use N nearest neighbours and finite support to reduce the number of points that must be considered for each interpolation value. A variant of this can be found in the first solution mentioned here Inverse Distance Weighted (IDW) Interpolation with Python. (though I have a nagging suspicion that this implementation can be discontinuous under certain conditions - there are certainly variants that are fine)
Your look-up table doesn't have to store every point in the 60k square, only those once which are used repeatedly. You can map any coordinate x to int(x*resolution) to improve the hit rate by lowering the resolution.
A similar lookup table for the power function might also help.