Efficient operator+ - c++

I have to compute large sums of 3d vectors and a comparison of using a vector class with overloaded operator+ and operator* versus summing up of separate components shows a performance difference of about a factor of three. I know assume the difference must be due to construction of objects in the overloaded operators.
How can one avoid the construction and improve performance?
I'm espacially puzzled, because the following is afaik basically the standard way to do it and I would expect the compiler to optimize this. In real life, the sums are not going to be done within a loop but in quite large expressions (several tens of MBs in total pre executable) summing up different vectors, this is why operator+ is used below.
class Vector
{
double x,y,z;
...
Vector&
Vector::operator+=(const Vector &v)
{
x += v.x;
y += v.y;
z += v.z;
return *this;
}
Vector
Vector::operator+(const Vector &v)
{
return Vector(*this) += v; // bad: construction and copy(?)
}
...
}
// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];
// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
sumxx = sumxx + xx[i];
sumyy = sumyy + yy[i];
sumzz = sumzz + zz[i];
}
Any help is greatly appreciated.
EDIT:
Thank you all for your great input, I have the performance now at the same level.
#Dima's and especially #Xeo's answer did the trick. I wish I could mark more than one answer "accepted". I'll test some of the other suggestions too.

This article has a really good argumentation on how to optimize operators such as +, -, *, /.
Implement the operator+ as a free function like this in terms of operator+=:
Vector operator+(Vector lhs, Vector const& rhs){
return lhs += rhs;
}
Notice on how the lhs Vector is a copy and not a reference. This allowes the compiler to make optimizations such as copy elision.
The general rule that article conveys: If you need a copy, do it in the parameters, so the compiler can optimize. The article doesn't use this example, but the operator= for the copy-and-swap idiom.

Why not replace
sum = sum + vec[i];
with
sum += vec[i];
... that should eliminate two calls to the copy constructor and one call to the assignment operator for each iteration.
But as always, profile and know where the expense is coming instead of guessing.

You might be interested in expression templates.

I implemented most of the optimizations being proposed here and compared it with the performance of a function call like
Vector::isSumOf( Vector v1, Vector v2)
{
x = v1.x + v2.x;
...
}
Repeatedly executing same loop with a few billion vector summations for every method in alternating order, did not result in the promised gains.
In case of the member function posted by bbtrb, this method took 50% more time than the isSumOf() function call.
Free, non member operator+ (Xeo) method needed up to double the time (100% more) of the is SumOf() function.
(gcc 4.6.3 -O3)
I aware of the fact, that this was not a representative testing, but since i could not reproduce any performance gains by using operators at all. I suggest to avoid them, if possible.

Usually, operator + looks like:
return Vector (x + v.x, y + v.y, z + v.z);
with a suitably defined constructor. This allows the compiler to do return value optimisation.
But if you're compiling for IA32, then SIMD would be worth considering, along with changes to the algorithms to take advantage of the SIMD nature. Other processors may have SIMD style instructions.

I think the difference in performance is caused by the compiler optimization here. Adding up elements of arrays in a loop can be vectorized by the compiler. Modern CPUs have instructions for adding multiple numbers in a single clock tick, such as SSE, SSE2, etc. This seems to be a likely explanation for the factor of 3 difference that you are seeing.
In other words, adding corresponding elements of two arrays in a loop may generally be faster than adding corresponding members of a class. If you represent the vector as an array inside your class, rather than x, y, and z, you may get the same speedup for your overloaded operators.

Are the implementations to your Vector operator functions directly in the header file or are they in a separate cpp file? In the header file they would typically be inlined in an optimized build. But if they are compiled in a different translation unit, then they often won't be (depending on your build settings). If the functions aren't inlined, then the compiler won't be able to do the type of optimization you are looking for.
In cases like these, have a look at the disassembly. Even if you don't know much about assembly code it's usually pretty easy to figure out what's different in simple cases like these.

Actually if you look at any real matrix code the operator+ and the operator+= don't do that.
Because of the copying involved they introduce a pseudo object into the expression and only do the real work when the assignment is executed. Using lazy evaluation like this also allows NULL operations to be removed during expression evaluation:
class Matrix;
class MatrixOp
{
public: virtual void DoOperation(Matrix& resultInHere) = 0;
};
class Matrix
{
public:
void operator=(MatrixOp* op)
{
// No copying has been done.
// You have built an operation tree.
// Now you are goign to evaluate the expression and put the
// result into *this
op->DoOperation(*this);
}
MatrixOp* operator+(Matrix& rhs) { return new MatrixOpPlus(*this,rhs);}
MatrixOp* operator+(MatrixOp* rhs){ return new MatrixOpPlus(*this,rhs);}
// etc
};
Of course this is a lot more complex than I have portrayed here in this simplified example. But if you use a library that has been designed for matrix operations then it will have already been done for you.

Your Vector implementation:
Implement the operator+() like this:
Vector
Vector::operator+(const Vector &v)
{
return Vector(x + v.x, y + v.y, z + v.z);
}
and add the inline operator in your class definition (this avoids the stack pushs and pops of the return address and method arguments for each method call, if the compiler finds it useful).
Then add this constructor:
Vector::Vector(const double &x, const double &y, const double &z)
: x(x), y(y), z(z)
{
}
which lets you construct a new vector very efficiently (like you would do in my operator+() suggestion)!
In the code using your Vector:
You did:
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
Unroll this kind of loops! Doing only one operation (as it would be optimized to using the SSE2/3 extensions or something similar) in a very large loop is very inefficient. You should rather do something like this:
//Unrolled loop:
for(int i = 0; i <= N - 10; i += 10)
{
sum = sum + vec[i];
+ vec[i+1];
+ vec[i+2];
+ vec[i+3];
+ vec[i+4];
+ vec[i+5];
+ vec[i+6];
+ vec[i+7];
+ vec[i+8];
+ vec[i+9];
}
//Doing the "rest":
for(int i = (N / 10) * 10; i < N; ++i)
{
sum = sum + vec[i];
}
(Note that this code is untested and may contain a "off-by-one"-error or so...)

Note that you are asking different things because the data is not disposed in the same way in memory. When using Vector array the coordinates are interleaved "x1,y1,z1,x2,y2,z2,...", while with the double arrays you have "x1,x2,...,y1,y2,...z1,z2...". I suppose this could have an impact on compiler optimizations or how the caching handles it.

Related

Returning class object by value or pass by reference, which will be faster here

Suppose , I have a class object matrix. To add two matrix of large element, I can define a operator overloading + or define a function Add like these
matrix operator + (const matrix &A, const matrix &B)
matrix C;
/* all required things */
for(int i .........){
C(i)=A(i)+B(i);
}
return C;
}
and I have call like,
matrix D = A+B;
Now if I define the Add function,
void Add(const matrix &A, const matrix &B, matrix &C)
C.resize(); // according to dimensions of A, B
// for C.resize , copy constructor will be called.
/* all required things */
for(int i .........){
C(i)=A(i)+B(i);
}
}
And I have to call this function like,
matrix D;
Add(A,B,D); //D=A+B
which of above method is faster and efficient. Which should we use ?
Without using any tools,
like a profiler (e.g. gprof) to see how much time is spent where,
nor any other tools like "valgrind + cachegrind" to see how many operations are performed in either of the two functions,
And also ignoring all compiler optimizations i.e. compiling with -O0,
And assuming whatever else there is in the two functions (what you represent as /* all required things */), is trivial,
Then all one can say, just by looking at both your functions is, that both of your functions have a complexity of O(n), since both your functions are spending most of the time in the two for loops. Depending on how big the size of the matrices is, especially if they are really large, everything else in the code is pretty much insignificant when it comes to down speed.
So, what your question boils down to, in my opinion is,
In how much time it takes,
to call the constructor of C
plus returning this C,
versus,
How much time it takes,
to call the resize function for C,
plus calling the copy constructor of C.
This you can 'crudely but relatively quickly' measure using the std::clock() or chrono as shown here in multiple answers.
#include <chrono>
auto t_start = std::chrono::high_resolution_clock::now();
matrix D = A+B; // To compare replace on 2nd run with this ---> matrix D; Add(A,B,D);
auto t_end = std::chrono::high_resolution_clock::now();
double elaspedTimeMs = std::chrono::duration<double, std::milli>(t_end-t_start).count();
Although once again, in my honest opinion if your matrices are big, most of the time would go in the for loop.
p.s. Premature optimization is the root of all evil.

Core Dumped While Multiplying Iteratively

I am trying to do something very simple. I have a class for functions, and a class for polynomials derived from the function class. In the polynomial, I am overloading the *= operator. But, when I invoke this operator, the program dumps the core and crashes.
Polynomial& Polynomial::operator*= (double c)
{
for(int i = 0; i <= degree; i++)
a[i] = a[i] * c;
return *this;
}
The polynomial class holds the coefficients in array a. The index of a directly relates to the power of x for that particular coefficient. Function main hands us the constant c, which we then multiply each coefficient by.
The prototype for the function is part of an assignment, or I would change it. I'm assuming there's something I'm doing wrong with respect to the return type. Any help is appreciated.
I am willing to provide more code if requested.
The return type is fine, I'm guessing the problem is i <= degree instead of i < degree. Arrays in C++ are 0-based.
EDIT: or perhaps you want to keep that as <= for consistency with the polynomial, in which case you need to allocate degree+1 items for your array.

Howto use matlab like operator in C++ with stl vector

In matlab we can use the matlab operator as follows:
M=[1 2 3 4, 5 6 7 8, 9 10 11 12]
M[:,1] = M[:,2] + M[:,3]
to apply the same operation to all the rows of a matrix
I'am wondering if we can apply a same operation to set values to a range of values in std::vector as it's done with colon(:) matlab's operator. Indeed, I'm using a vector to store the matrix values.
vector<int> M;
Thanks in advance.
There are C++ libraries that allow one to handle matrices pretty much as matlab does (allowing also for SIMD vectorization); you may want to consider eigen, for instance.
If you don't want to rely on external library you may want to consider std::valarray which has been explicitly thought for algebraic computations (with valarrays you may use std::slices to extract submatrices as you need).
You can define a "free" operator that takes std::vector<int> as parameters:
std::vector<int> operator +(const std::vector<int> &a, const std::vector<int> &b)
{
std::vector<int> result(a); // Copy the 'a' operand.
// The usual matrix addition is defined for two matrices of the same dimensions.
if (a.size() == b.size())
{
// The sum of two matrices a and a, is computed by adding corresponding elements.
for (std::vector<int>::size_type i = 0; i < b.size(); ++b)
// Add the values of the 'b' operand.
result[i] += b[i];
return result;
}
}
int main(int argc, char **argv)
{
std::vector<int> a;
std::vector<int> b;
// The copy constructor takes care of the assignement.
std::vector<int> c(a + b);
return 0;
}
The implementation of the operator + is quite naive, but is just an idea. Beware!, i've placed a ckeck before the add operation, if the check isn't passed a copy of the a operand is returned, i think that this will not be your desired behavior.
I've placed the operator in the same file of main but you can place it wherever you want as long as it is visible where the operation is performed.
Of course, you can define the operators you want in order to chain operations to achieve some more complex operations.
My maths concepts are quite old, but i hope it helps.

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.
Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...
template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i)
{
const int index = indices.size() * indices[i] + indices[i + 1];
cost += matrix[index];
}
const int index = indices.size() * indices[indices.size() - 1] + indices[0];
cost += matrix[index];
return cost;
}
template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
cost += matrix[indices[indices.size() - 1]][indices[0]];
return cost;
}
Oh, and PPL/TBB are fair game too.
Edit
As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.
Edit 2
Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.
Edit 3
Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.
DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.
This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.
And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.
That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.
You could produce an iterator that takes matrix and indices and yields the appropriate values.
class route_iterator
{
vector<vector<float>> const& matrix;
vector<int> const& indices;
int i;
public:
route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
int begin = 0)
: matrix(matrix_), indices(indices_), i(begin)
{ }
float operator*() {
return matrix[indices[i]][indices[(i + 1) % indices.size()]];
}
route_iterator& operator++() {
++i;
return *this;
}
};
Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).
Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
This is basically std::accumulate. PPL provides (and so does TBB, if I recall) parallel_reduce. This requires associativity but not commutivity, and + over the real/float/integer is associative.

Overloading operator[] to start at 1 and performance overhead

I am doing some C++ computational mechanics (don't worry, no physics knowledge required here) and there is something that really bothers me.
Suppose I want to represent a 3D math Vector (nothing to do with std::vector):
class Vector {
public:
Vector(double x=0., double y=0., double z=0.) {
coordinates[0] = x;
coordinates[1] = y;
coordinates[2] = z;
}
private:
double coordinates[3];
};
So far so good. Now I can overload operator[] to extract coordinates:
double& Vector::operator[](int i) {
return coordinates[i] ;
}
So I can type:
Vector V;
… //complex computation with V
double x1 = V[0];
V[1] = coord2;
The problem is, indexing from 0 is NOT natural here. I mean, when sorting arrays, I don't mind, but the fact is that the conventionnal notation in every paper, book or whatever is always substripting coordinates beginning with 1.
It may seem a quibble but the fact is that in formulas, it always takes a double-take to understand what we are taking about. Of course, this is much worst with matrices.
One obvious solution is just a slightly different overloading :
double& Vector::operator[](int i) {
return coordinates[i-1] ;
}
so I can type
double x1 = V[1];
V[2] = coord2;
It seems perfect except for one thing: this i-1 subtraction which seems a good candidate for a small overhead. Very small you would say, but I am doing computationnal mechanics, so this is typically something we couldn't afford.
So now (finally) my question: do you think a compiler can optimize this, or is there a way to make it optimize ? (templates, macro, pointer or reference kludge...)
Logically, in
double xi = V[i];
the integer between the bracket being a literal most of the time (except in 3-iteration for loops), inlining operator[] should make it possible, right ?
(sorry for this looong question)
EDIT:
Thanks for all your comments and answers
I kind of disagree with people telling me that we are used to 0-indexed vectors.
From an object-oriented perspective, I see no reason for a math Vector to be 0-indexed because implemented with a 0-indexed array. We're not suppose to care about the underlying implementation. Now, suppose I don't care about performance and use a map to implement Vector class. Then I would find it natural to map '1' with the '1st' coordinate.
That said I tried out with 1-indexed vectors and matrices, and after some code writing, I find it not interacting nicely every time I use an array around. I thougth Vector and containers (std::array,std::vector...) would not interact often (meaning, transfering data between one another), but it seems I was wrong.
Now I have of a solution that I think is less controversial (please give me your opinion) :
Every time I use a Vector in some physical context, I think of using an enum :
enum Coord {
x = 0,
y = 1,
z = 2
};
Vector V;
V[x] = 1;
The only disadvantage I see being that these x,y and z can be redefined without enven a warning...
This one should be measured or verified by looking at the disassembly, but my guess is: The getter function is tiny and its arguments are constant. There is a high chance the compiler will inline the function and constant-fold the subtraction. In that case the runtime cost would be zero.
Why not to try this:
class Vector {
public:
Vector(double x=0., double y=0., double z=0.) {
coordinates[1] = x;
coordinates[2] = y;
coordinates[3] = z;
}
private:
double coordinates[4];
};
If you are not instantiating your object in quantities of millions, then the memory waist might be affordable.
Have you actually profiled it or examined the generated code? That's how this question is answered.
If the operator[] implementation is visible then this is likely to be optimized to have zero overhead.
I recommend you define this in the header (.h) for your class. If you define it in the .cpp then the compiler can't optimize as much. Also, your index should not be an "int" which can have negative values... make it a size_t:
class Vector {
// ...
public:
double& operator[](const size_t i) {
return coordinates[i-1] ;
}
};
You cannot say anything objective about performance without benchmarking. On x86, this subtraction can be compiled using relative addressing, which is very cheap. If operator[] is inlined, then the overhead is zero—you can encourage this with inline or with compiler-specific instructions such as GCC’s __attribute__((always_inline)).
If you must guarantee it, and the offset is a compile-time constant, then using a template is the way to go:
template<size_t I>
double& Vector::get() {
return coordinates[i - 1];
}
double x = v.get<1>();
For all practical purposes, this is guaranteed to have zero overhead thanks to constant-folding. You could also use named accessors:
double Vector::x() const { return coordinates[0]; }
double Vector::y() const { return coordinates[1]; }
double Vector::z() const { return coordinates[2]; }
double& Vector::x() { return coordinates[0]; }
double& Vector::y() { return coordinates[1]; }
double& Vector::z() { return coordinates[2]; }
And for loops, iterators:
const double* Vector::begin() const { return coordinates; }
const double* Vector::end() const { return coordinates + 3; }
double* Vector::begin() { return coordinates; }
double* Vector::end() { return coordinates + 3; }
// (x, y, z) -> (x + 1, y + 1, z + 1)
for (auto& i : v) ++i;
Like many of the others here, however, I disagree with the premise of your question. You really should simply use 0-based indexing, as it is more natural in the realm of C++. The language is already very complex, and you need not complicate things further for those who will maintain your code in the future.
Seriously, benchmark this all three ways (ie, compare the subtraction and the double[4] methods to just using zero-based indices in the caller).
It's entirely possible you'll get a huge win from forcing 16-byte alignment on some cache architectures, and equally possible the subtraction is effectively free on some compiler/instruction set/code path combinations.
The only way to tell is to benchmark realistic code.