Comparison approximation in c++ - c++

Is there a way to change the precision of the predefined <= and < in the comparison between std::vector<double> vectors lexicographically ?
I am comparing between std::vector<double> vectors lexicographically in many places in my code, the scale of the first component (close to 0) is different from the scale of the other components (between -700 and 700). I want the precision to be 1e-6, i.e for two components a and b if abs(a-b)<=1e-6 then we consider a=b, where a, b are double.
Since I used <= and < in many places in the code, defining a new function that replaces <= and < to compare between vectors is risky (I make skip some vectors), so I am wondering if it is possible to change the precision of <= and < so this change will apply to all the comparisons directly.
An example of vectors I have: A=(-2.6666666666666936, 33497.435897435964, -300.51282051282101), B=(-2.6666666666666914, 17403.589743589808, -251.28205128205173), the result by using <=, is that A<=B because of the first component, but in my case, the first components are equal with (with epsilon=1e-6) so A>B.

There is no good way to change the precision of the operators.
I'd suggest you write your own function that iterates over the two vectors and does the comparisons directly. Something like:
bool approxLessThan(
const std::vector<double>& a,
const std::vector<double>& b,
double tolerance) {
// feel free to handle this differently
assert(a.size() == b.size());
for (size_t i =0; i < a.size(); i++) {
double dif = a[i] - b[i];
if (std::abs(dif) > tolerance)
return dif < 0.0; // change this to <= as needed
}
return false; // The vectors are "equal"
}
You can expand this to handle vectors of different sizes if needed.

Related

What is the most efficient way to repeat elements in a vector and apply a set of different functions across all elements using Eigen?

Say I have a vector containing only positive, real elements defined like this:
Eigen::VectorXd v(1.3876, 8.6983, 5.438, 3.9865, 4.5673);
I want to generate a new vector v2 that has repeated the elements in v some k times. Then I want to apply k different functions to each of the repeated elements in the vector.
For example, if v2 was v repeated 2 times and I applied floor() and ceil() as my two functions, the result based on the above vector would be a column vector with values: [1; 2; 8; 9; 5; 6; 3; 4; 4; 5]. Preserving the order of the original values is important here as well. These values are also a simplified example, in practice, I'm generating vectors v with ~100,000 or more elements and would like to make my code as vectorizable as possible.
Since I'm coming to Eigen and C++ from Matlab, the simplest approach I first took was to just convert this Nx1 vector into an Nx2 matrix, apply floor to the first column and ceil to the second column, take the transpose to get a 2xN matrix and then exploit the column-major nature of the matrix and reshape the 2xN matrix into a 2Nx1 vector, yielding the result I want. However, for large vectors, this would be very slow and inefficient.
This response by ggael effectively addresses how I could repeat the elements in the input vector by generating a sequence of indices and indexing the input vector. I could just then generate more sequences of indices to apply my functions to the relevant elements v2 and copy the result back to their respective places. However, is this really the most efficient approach? I dont fully grasp copy-on-write and move semantics, but I think the second indexing expressions would be in a sense redundant?
If that is true, then my guess is that a solution here would be some sort of nullary or unary expression where I could define an expression that accepts the vector, some index k and k expressions/functions to apply to each element and spits out the vector I'm looking for. I've read the Eigen documentation on the subject, but I'm struggling to build a functional example. Any help would be appreciated!
So, if I understand you correctly, you don't want to replicate (in terms of Eigen methods) the vector, you want to apply different methods to the same elements and store the result for each, correct?
In this case, computing it sequentially once per function is the easiest route. Most CPUs can only do one (vector) memory store per clock cycle, anyway. So for simple unary or binary operations, your gains have an upper bound.
Still, you are correct that one load is technically always better than two and it is a limitation of Eigen that there is no good way of achieving this.
Know that even if you manually write a loop that would generate multiple outputs, you should limit yourself in the number of outputs. CPUs have a limited number of line-fill buffers. IIRC Intel recommended using less than 10 "output streams" in tight loops, otherwise you could stall the CPU on those.
Another aspect is that C++'s weak aliasing restrictions make it hard for compilers to vectorize code with multiple outputs. So it might even be detrimental.
How I would structure this code
Remember that Eigen is column-major, just like Matlab. Therefore use one column per output function. Or just use separate vectors to begin with.
Eigen::VectorXd v = ...;
Eigen::MatrixX2d out(v.size(), 2);
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
Following the KISS principle, this is good enough. You will not gain much if anything by doing something more complicated. A bit of multithreading might gain you something (less than factor 2 I would guess) because a single CPU thread is not enough to max out memory bandwidth but that's about it.
Some benchmarking
This is my baseline:
int main()
{
int rows = 100013, repetitions = 100000;
Eigen::VectorXd v = Eigen::VectorXd::Random(rows);
Eigen::MatrixX2d out(rows, 2);
for(int i = 0; i < repetitions; ++i) {
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
}
}
Compiled with gcc-11, -O3 -mavx2 -fno-math-errno I get ca. 5.7 seconds.
Inspecting the assembler code finds good vectorization.
Plain old C++ version:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
for(std::ptrdiff_t j = 0; j < rows; ++j) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
40 seconds instead of 5! This version actually does not vectorize because the compiler cannot prove that the arrays don't alias each other.
Next, let's use fixed size Eigen vectors to get the compiler to generate vectorized code:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
std::ptrdiff_t j;
for(j = 0; j + 4 <= rows; j += 4) {
const Eigen::Vector4d vj = Eigen::Vector4d::Map(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector4d::Map(outfloor + j) = floorval;
Eigen::Vector4d::Map(outceil + j) = ceilval;;
}
if(j + 2 <= rows) {
const Eigen::Vector2d vj = Eigen::Vector2d::MapAligned(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector2d::Map(outfloor + j) = floorval;
Eigen::Vector2d::Map(outceil + j) = ceilval;;
j += 2;
}
if(j < rows) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
7.5 seconds. The assembler looks fine, fully vectorized. I'm not sure why performance is lower. Maybe cache line aliasing?
Last attempt: We don't try to avoid re-reading the vector but we re-read it blockwise so that it will be in cache by the time we read it a second time.
const int blocksize = 64 * 1024 / sizeof(double);
std::ptrdiff_t j;
for(j = 0; j + blocksize <= rows; j += blocksize) {
const auto& vj = v.segment(j, blocksize);
auto outj = out.middleRows(j, blocksize);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
}
const auto& vj = v.tail(rows - j);
auto outj = out.bottomRows(rows - j);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
5.4 seconds. So there is some gain here but not nearly enough to justify the added complexity.

Why isn't the read only [] operator used?

I'm currently writing an Polynomial-class in C++, which should represent an polynomial of the following form:
p(x) = a_0 + a_1*x^1 + a_2*x^2 + ... + a_i*x^i
where a_0, ..., a_i are all int's.
The class internally uses an member variable a_ of typestd::vector<int> to store the constant factors a_0, ..., a_i. To access the constant factors the operator[] is overloaded in the following way:
Read and write:
int &operator[](int i)
{
return a_.at(i);
}
This will fail when trying to change one of the factors a_i with:
i > degree of polynomial = a_.size() - 1
Read-only:
int operator[](int i) const
{
if (i > this->degree()) {
return 0;
}
return a_.at(i);
}
The slightly different implementation allows rather comfortable looping over the factors of two different sized polynomials (without worrying about the degree of the polynomial).
Sadly I seem to miss something here, since the operator+-overloading (which makes use of this comfortable read-only-operator[]) fails.
operator+-overloading:
Polynomial operator*(const Polynomial &other) {
Polynomial res(this->degree() + other.degree());
for (int i = 0; i <= res.degree(); ++i) {
for (int k = 0; k <= i; ++k) {
res[i] += (*this)[k] * other[i-k];
}
}
return res;
}
Don't mind the math involved. The important point is, that the i is always in the range of
0 <= i < res.a_.size()
thus writing to res[i] is valid. However (*this)[k] and other[i-k] try to read from indices which don't necessarily lay in the range [0, (*this).a_.size() - 1].
This should be fine with our read-only-implementation of the operator[] right? I still get an error trying to access a_ at invalid indices. What could cause the compiler to use the read-write-implementation in the line:
res[i] += (*this)[k] * other[i-k];
Especially the part on the right side of the equality.
I'm certain the error is caused by the "wrong" use of the read-and-write-operator[]. Because with an additional check fixes the invalid access:
if (k <= this->degree() && i-k <= other.degree()) {
res[i] += (*this)[k] * other[i-k];
}
What am I missing with the use of the operator[]-overloading? Why isn't the read-only-operator[] used here?
(*this)[k] is using the non-const this as the function containing it is not const.
Hence the non-const overload of [] is preferred by the compiler.
You could get round this using an ugly const_cast, but really you ought to keep the behaviour of the two versions of the [] operator as similar as possible. Besides, the std::vector overload of [] doesn't insist on the index being bounds checked, as opposed to at which must be. Your code is a deviation from this and therefore could confuse readers of your code.

Number of parallelograms on a NxM grid

I have to solve a problem when Given a grid size N x M , I have to find the number of parallelograms that "can be put in it", in such way that they every coord is an integer.
Here is my code:
/*
~Keep It Simple!~
*/
#include<fstream>
#define MaxN 2005
int N,M;
long long Paras[MaxN][MaxN]; // Number of parallelograms of Height i and Width j
long long Rects; // Final Number of Parallelograms
int cmmdc(int a,int b)
{
while(b)
{
int aux = b;
b = a -(( a/b ) * b);
a = aux;
}
return a;
}
int main()
{
freopen("paralelograme.in","r",stdin);
freopen("paralelograme.out","w",stdout);
scanf("%d%d",&N,&M);
for(int i=2; i<=N+1; i++)
for(int j=2; j<=M+1; j++)
{
if(!Paras[i][j])
Paras[i][j] = Paras[j][i] = 1LL*(i-2)*(j-2) + i*j - cmmdc(i-1,j-1) -2; // number of parallelograms with all edges on the grid + number of parallelograms with only 2 edges on the grid.
Rects += 1LL*(M-j+2)*(N-i+2) * Paras[j][i]; // each parallelogram can be moved in (M-j+2)(N-i+2) places.
}
printf("%lld", Rects);
}
Example : For a 2x2 grid we have 22 possible parallelograms.
My Algorithm works and it is correct, but I need to make it a little bit faster. I wanna know how is it possible.
P.S. I've heard that I should pre-process the greatest common divisor and save it in an array which would reduce the run-time to O(n*m), but I'm not sure how to do that without using the cmmdc ( greatest common divisor ) function.
Make sure N is not smaller than M:
if( N < M ){ swap( N, M ); }
Leverage the symmetry in your loops, you only need to run j from 2 to i:
for(int j=2; j<=min( i, M+1); j++)
you don't need an extra array Paras, drop it. Instead use a temporary variable.
long long temparas = 1LL*(i-2)*(j-2) + i*j - cmmdc(i-1,j-1) -2;
long long t1 = temparas * (M-j+2)*(N-i+2);
Rects += t1;
// check if the inverse case i <-> j must be considered
if( i != j && i <= M+1 ) // j <= N+1 is always true because of j <= i <= N+1
Rects += t1;
Replace this line: b = a -(( a/b ) * b); using the remainder operator:
b = a % b;
Caching the cmmdc results would probably be possible, you can initialize the array using sort of sieve algorithm: Create an 2d array indexed by a and b, put "2" at each position where a and b are multiples of 2, then put a "3" at each position where a and b are multiples of 3, and so on, roughly like this:
int gcd_cache[N][N];
void init_cache(){
for (int u = 1; u < N; ++u){
for (int i = u; i < N; i+=u ) for (int k = u; k < N ; k+=u ){
gcd_cache[i][k] = u;
}
}
}
Not sure if it helps a lot though.
The first comment in your code states "keep it simple", so, in the light of that, why not try solving the problem mathematically and printing the result.
If you select two lines of length N from your grid, you would find the number of parallelograms in the following way:
Select two points next to each other in both lines: there is (N-1)^2
ways of doing this, since you can position the two points on N-1
positions on each of the lines.
Select two points with one space between them in both lines: there is (N-2)^2 ways of doing this.
Select two points with two, three and up to N-2 spaces between them.
The resulting number of combinations would be (N-1)^2+(N-2)^2+(N-3)^2+...+1.
By solving the sum, we get the formula: 1/6*N*(2*N^2-3*N+1). Check WolframAlpha to verify.
Now that you have a solution for two lines, you simply need to multiply it by the number of combinations of order 2 of M, which is M!/(2*(M-2)!).
Thus, the whole formula would be: 1/12*N*(2*N^2-3*N+1)*M!/(M-2)!, where the ! mark denotes factorial, and the ^ denotes a power operator (note that the same sign is not the power operator in C++, but the bitwise XOR operator).
This calculation requires less operations that iterating through the matrix.

Extracting lowest complex number from a vector

The vector is taking points to a rectangle. I want to be able to take the lowest and highest complex number and assigne it to 2 different complex numbers. I've only tried for the lowest with the below but it doesnt seem to find the lowest and just return the 1st element.
vector < complex<double>* > xs;
typedef typename vector < complex<double>* >::iterator Iter;
xs.push_back(&pointa);
xs.push_back(&pointb);
xs.push_back(&pointc);
xs.push_back(&pointd);
for (Iter p = xs.begin(); p != xs.end(); ++p)
{
if((*p)->real()<(*p+1)->real() && (*p)->imag()<(*p+1)->imag())
{
double a = (*p)->real();
double b = (*p)->imag();
complex <double> botleft_1(a,b);
}
else
{
}
}
Any suggestions?
The immediate bug in your code is that *p+1 means (*p)+1.
The next bug after you fix that to *(p+1) will be that you go one element off the end of the vector. You should be comparing each value with the lowest/highest so far, not with the next value in the vector.
There is in any case no such thing as the "lowest" or "highest" complex number - the complex numbers are not an ordered field (theorem). You can define any comparison operator you like on complex numbers, but it will be pretty arbitrary, for example it won't necessarily have ordered field properties such as a < b && c < d => a+c < b+d. The comparison you have defined does have that property, but is not a strict weak order, so it might not behave the way you expect once you start doing comparisons among 3 or more values.
For example, consider the values complex<double> a(1,1), b(0,3), c(2,2);, and the comparison function lt(complex<double> lhs, complex<double> rhs) { return lhs.real() < rhs.real() && lhs.imag() < rhs.imag(); }.
Then lt(a,b) and lt(b,a) are both false. This means a and b are equivalent as far as the ordering is concerned.
Likewise, lt(b,c) and lt(c,b) are both false. This means b and c are equivalent as far as the ordering is concerned.
However, lt(a,c) is true. This means a and c are not equivalent as far as the ordering is concerned.
In the loop you do not compare with the lowest number, only with next number. Try something like this:
complex<double> *lowest = *xs.begin();
for (Iter p = xs.begin() + 1; p != xs.end(); ++p){
if ((*p)->real() < lowest->real() && (*p)->imag() < lowest->imag())
lowest = *p;
}
After the loop, the variable lowest will have be the one you want.
Also, in your version of the loop, you compare to p + 1 which will be xs.end() on the last item, and that will not be a valid pointer.
Use boost::minmax_element
std::pair< Iter > pairit = boost::minmax_element( xs.begin(), xs.end(),
[&]( complex<double>* pcomplexA, complex<double>* pcomplexB ) {
// Suitable comparison predicate (see Steve Jessop's answer)
return pcomplexA->abs() < pcomplexB->abs(); // |a| < |b|
});

Long array performance issue

I have an array of char pointers of length 175,000. Each pointer points to a c-string array of length 100, each character is either 1 or 0. I need to compare the difference between the strings.
char* arr[175000];
So far, I have two for loops where I compare every string with every other string. The comparison functions basically take two c-strings and returns an integer which is the number of differences of the arrays.
This is taking really long on my 4-core machine. Last time I left it to run for 45min and it never finished executing. Please advise of a faster solution or some optimizations.
Example:
000010
000001
have a difference of 2 since the last two bits do not match.
After i calculate the difference i store the value in another array
int holder;
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = 0; y < UsedTableSpace; y++){
if(x != y){
//compr calculates difference between two c-string arrays
int tempDiff =compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
}
similarity[holder]->inbound++;
}
With more information, we could probably give you better advice, but based on what I understand of the question, here are some ideas:
Since you're using each character to represent a 1 or a 0, you're using several times more memory than you need to use, which creates a big performance impact when it comes to caching and such. Instead, represent your data using numeric values that you can think of in terms of a series of bits.
Once you've implemented #1, you can grab an entire integer or long at a time and do a bitwise XOR operation to end up with a number that has a 1 in every place where the two numbers didn't have the same values. Then you can use some of the tricks mentioned here to count these bits speedily.
Work on "unrolling" your loops somewhat to avoid the number of jumps necessary. For example, the following code:
total = total + array[i];
total = total + array[i + 1];
total = total + array[i + 2];
... will work faster than just looping over total = total + array[i] three times. Jumps are expensive, and interfere with the processor's pipelining. Update: I should mention that your compiler may be doing some of this for you already--you can check the compiled code to see.
Break your overall data set into chunks that will allow you to take full advantage of caching. Think of your problem as a "square" with the i index on one axis and the j axis on the other. If you start with one i and iterate across all 175000 j values, the first j values you visit will be gone from the cache by the time you get to the end of the line. On the other hand, if you take the top left corner and go from j=0 to 256, most of the values on the j axis will still be in a low-level cache as you loop around to compare them with i=0, 1, 2, etc.
Lastly, although this should go without saying, I guess it's worth mentioning: Make sure your compiler is set to optimize!
One simple optimization is to compare the strings only once. If the difference between A and B is 12, the difference between B and A is also 12. Your running time is going to drop almost half.
In code:
int compr(const char* a, const char* b) {
int d = 0, i;
for (i=0; i < 100; ++i)
if (a[i] != b[i]) ++d;
return d;
}
void main_function(...) {
for(int x = 0;x < UsedTableSpace; x++){
int min = 10000000;
for(int y = x + 1; y < UsedTableSpace; y++){
//compr calculates difference between two c-string arrays
int tempDiff = compr(similarity[x]->matrix, similarity[y]->matrix);
if(tempDiff < min){
min = tempDiff;
holder = y;
}
}
similarity[holder]->inbound++;
}
}
Notice the second-th for loop, I've changed the start index.
Some other optimizations is running the run method on separate threads to take advantage of your 4 cores.
What is your goal, i.e. what do you want to do with the Hamming Distances (which is what they are) after you've got them? For example, if you are looking for the closest pair, or most distant pair, you probably can get an O(n ln n) algorithm instead of the O(n^2) methods suggested so far. (At n=175000, n^2 is 15000 times larger than n ln n.)
For example, you could characterize each 100-bit number m by 8 4-bit numbers, being the number of bits set in 8 segments of m, and sort the resulting 32-bit signatures into ascending order. Signatures of the closest pair are likely to be nearby in the sorted list. It is easy to lower-bound the distance between two numbers if their signatures differ, giving an effective branch-and-bound process as less-distant numbers are found.