Overloading operator[] to start at 1 and performance overhead - c++

I am doing some C++ computational mechanics (don't worry, no physics knowledge required here) and there is something that really bothers me.
Suppose I want to represent a 3D math Vector (nothing to do with std::vector):
class Vector {
public:
Vector(double x=0., double y=0., double z=0.) {
coordinates[0] = x;
coordinates[1] = y;
coordinates[2] = z;
}
private:
double coordinates[3];
};
So far so good. Now I can overload operator[] to extract coordinates:
double& Vector::operator[](int i) {
return coordinates[i] ;
}
So I can type:
Vector V;
… //complex computation with V
double x1 = V[0];
V[1] = coord2;
The problem is, indexing from 0 is NOT natural here. I mean, when sorting arrays, I don't mind, but the fact is that the conventionnal notation in every paper, book or whatever is always substripting coordinates beginning with 1.
It may seem a quibble but the fact is that in formulas, it always takes a double-take to understand what we are taking about. Of course, this is much worst with matrices.
One obvious solution is just a slightly different overloading :
double& Vector::operator[](int i) {
return coordinates[i-1] ;
}
so I can type
double x1 = V[1];
V[2] = coord2;
It seems perfect except for one thing: this i-1 subtraction which seems a good candidate for a small overhead. Very small you would say, but I am doing computationnal mechanics, so this is typically something we couldn't afford.
So now (finally) my question: do you think a compiler can optimize this, or is there a way to make it optimize ? (templates, macro, pointer or reference kludge...)
Logically, in
double xi = V[i];
the integer between the bracket being a literal most of the time (except in 3-iteration for loops), inlining operator[] should make it possible, right ?
(sorry for this looong question)
EDIT:
Thanks for all your comments and answers
I kind of disagree with people telling me that we are used to 0-indexed vectors.
From an object-oriented perspective, I see no reason for a math Vector to be 0-indexed because implemented with a 0-indexed array. We're not suppose to care about the underlying implementation. Now, suppose I don't care about performance and use a map to implement Vector class. Then I would find it natural to map '1' with the '1st' coordinate.
That said I tried out with 1-indexed vectors and matrices, and after some code writing, I find it not interacting nicely every time I use an array around. I thougth Vector and containers (std::array,std::vector...) would not interact often (meaning, transfering data between one another), but it seems I was wrong.
Now I have of a solution that I think is less controversial (please give me your opinion) :
Every time I use a Vector in some physical context, I think of using an enum :
enum Coord {
x = 0,
y = 1,
z = 2
};
Vector V;
V[x] = 1;
The only disadvantage I see being that these x,y and z can be redefined without enven a warning...

This one should be measured or verified by looking at the disassembly, but my guess is: The getter function is tiny and its arguments are constant. There is a high chance the compiler will inline the function and constant-fold the subtraction. In that case the runtime cost would be zero.

Why not to try this:
class Vector {
public:
Vector(double x=0., double y=0., double z=0.) {
coordinates[1] = x;
coordinates[2] = y;
coordinates[3] = z;
}
private:
double coordinates[4];
};
If you are not instantiating your object in quantities of millions, then the memory waist might be affordable.

Have you actually profiled it or examined the generated code? That's how this question is answered.
If the operator[] implementation is visible then this is likely to be optimized to have zero overhead.

I recommend you define this in the header (.h) for your class. If you define it in the .cpp then the compiler can't optimize as much. Also, your index should not be an "int" which can have negative values... make it a size_t:
class Vector {
// ...
public:
double& operator[](const size_t i) {
return coordinates[i-1] ;
}
};

You cannot say anything objective about performance without benchmarking. On x86, this subtraction can be compiled using relative addressing, which is very cheap. If operator[] is inlined, then the overhead is zero—you can encourage this with inline or with compiler-specific instructions such as GCC’s __attribute__((always_inline)).
If you must guarantee it, and the offset is a compile-time constant, then using a template is the way to go:
template<size_t I>
double& Vector::get() {
return coordinates[i - 1];
}
double x = v.get<1>();
For all practical purposes, this is guaranteed to have zero overhead thanks to constant-folding. You could also use named accessors:
double Vector::x() const { return coordinates[0]; }
double Vector::y() const { return coordinates[1]; }
double Vector::z() const { return coordinates[2]; }
double& Vector::x() { return coordinates[0]; }
double& Vector::y() { return coordinates[1]; }
double& Vector::z() { return coordinates[2]; }
And for loops, iterators:
const double* Vector::begin() const { return coordinates; }
const double* Vector::end() const { return coordinates + 3; }
double* Vector::begin() { return coordinates; }
double* Vector::end() { return coordinates + 3; }
// (x, y, z) -> (x + 1, y + 1, z + 1)
for (auto& i : v) ++i;
Like many of the others here, however, I disagree with the premise of your question. You really should simply use 0-based indexing, as it is more natural in the realm of C++. The language is already very complex, and you need not complicate things further for those who will maintain your code in the future.

Seriously, benchmark this all three ways (ie, compare the subtraction and the double[4] methods to just using zero-based indices in the caller).
It's entirely possible you'll get a huge win from forcing 16-byte alignment on some cache architectures, and equally possible the subtraction is effectively free on some compiler/instruction set/code path combinations.
The only way to tell is to benchmark realistic code.

Related

Performance implications of C++ unions

In Agner Fog's "Optimizing software in C++" it is stated that union forces a variable to be stored in memory even in cases where it otherwise could have been stored in a register, which might have performance implications. (e.g. page 148)
I often see code that looks like this:
struct Vector {
union {
struct {
float x, y, z, w;
};
float v[4];
}
};
This can be quite convenient, but now I'm wondering if there might be potential performance hit.
I wrote a small benchmark that compares Vector implementations with and without union and there where cases where the Vector without union apparently performed better, although I don't know how trust-worthy my benchmark is. (I compared three implementations: union; x, y, z, w; v[4]. For example, v[4] seemed to be slower when passed by value, although the structs all have the same size.)
My question now is, whether this is something that people consider when writing actual production code? Do you know of cases where it was decided against unions specifically for this reason?
It appears the goal is to provide friendly names for elements of a vector type, and union is not the best way to do that. Comments have pointed out the undefined behavior already, and even if it works its a form of aliasing which limits optimization opportunities.
Instead, avoid the whole mess and just add accessors that name the elements.
struct quaternion
{
float vec[4];
float &x() { return vec[0]; }
float &y() { return vec[1]; }
float &z() { return vec[2]; }
float &w() { return vec[3]; }
const float &x() const { return vec[0]; }
const float &y() const { return vec[1]; }
const float &z() const { return vec[2]; }
const float &w() const { return vec[3]; }
}
In fact, much as Eigen does for its quaternion implementation:
https://eigen.tuxfamily.org/dox/Quaternion_8h_source.html
My question now is, whether this is something that people consider when writing actual production code?
No. That's premature optimization (the union construct itself also is). Once the code is written in somewhat clean and reliable way, it can be profiled and true bottlenecks addressed. No need to reason above some union for 5 minutes to guess whether it will affect performance somewhere in the future. It either will, or will not, and only profiling can tell.

Three-dimensional array as a vector of arrays

I have 3-dim double array with two of its dimensions known at compile time.
So to make it efficient, I'd like to write something like
std::vector<double[7][19]> v;
v.resize(3);
v[2][6][18] = 2 * 7*19 + 6 * 19 + 18;
It's perfect except it does not compile because of "v.resize(3);"
Please don't recommend me using nested vectors like
std::vector<std::vector<std::vector<double>>> v;
because I don't want to set sizes for known dimensions each time I extend v by its first dimension.
What is the cleanest solution here?
Why not a std::vector of std::array of std::array of double?
std::vector<std::array<std::array<double, 19>, 7>> v;
This is a good example of when it's both reasonable and useful to inherit from std::vector, providing:
you can be sure they won't be deleted using a pointer to their vector base class (which would be Undefined Behaviour as the vector destructor's not virtual)
you're prepared to write a forwarding constructor or two if you want to use them
ideally, you're using this in a relatively small application rather than making it part of a library API with lots of distributed client users - the more hassle client impact could be, the more pedantic one should be about full encapsulation
template <typename T, size_t Y, size_t Z>
struct Vec_3D : std::vector<std::array<std::array<T, Y>, Z>>
{
T& operator(size_t x, size_t y, size_t z)
{
return (*this)[x * Y * Z + y * Y + z];
}
const T& operator(size_t x, size_t y, size_t z) const
{
return (*this)[x * Y * Z + y * Y + z];
}
};
So little effort, then you've got a nicer and less error prone v(a, b, c) notation available.
Concerning the derivation, note that:
your derived type makes no attempt to enforce different invariants on the object
it doesn't add any data members or other bases
(The lack of data members / extra bases means even slicing and accidental deletion via a vector* are likely to work in practice, even though they should be avoided it's kind of nice to know you're likely playing with smoke rather than fire.)
Yes I know Alexandrescu, Sutter, Meyers etc. recommend not to do this - I've read their reasons very carefully several times and if you want to champion them even for this usage, please bring relevant technical specifics to the table....

Sort a vector of structs

I have a vector of structs and I need help with how to sort them according to one of the values, and if those 2 values are the same, then sort it according to another parameter.
This is similar to other questions, but it has more to it.
What I am trying to implement is the scan line based polygon fill algorithm.
I build the active edge list, but then I need to sort it based on the x value in each struct object. If the x values are the same, then they need to be sorted based on the inverse of the slopes for each struct object.
Here is the definition of the struct with the override operator < for normal sorting:
struct Bucket
{
// Fields of a bucket list
int ymax, x, dx, dy, sum;
// Override the < operator, used for sorting based on the x value
bool operator < (const Bucket& var) const
{
// Check if the x values are the same, if so
// sort based on the ivnerse of the slope (dx/dy)
/*if(x == var.x)
return (dx/dy) < (var.dx/var.dy);
else*/
return (x < var.x);
}
};
I commented out the if then else statement because it does compile, but causes a floating point error and the program crashes.
The exact error is: "Floating point exception (core dumped)"
I also tried casting each division to (int) but that did not work either.
My question: Is there a way to do the sort similar to the way I have it, or should I write my own sort method.
If I should make my own sort method, please provide a link or something to a simple method which can help.
Thanks
You should implement double division, because with integers, when you have for example 5/6 it results in 0, and division by 0 is not possible as we know. That's why the program crashes.
SO change the members of the structure to doubles.And then you should take care of some precision issues but at least the program won't crash assuming that you are not allowing 0 value for dy.
You can use tuple which overrides different operators for lexicographic comparisons (http://en.cppreference.com/w/cpp/utility/tuple/operator_cmp)
typedef std::tuple<int, int, int, int, int> Bucket;
But it's a bit annoying to change your struct to a tuple. You can use tie that will make the tuple for you.
bool operator < (const Bucket& var) const
{
std::tie(x, dx/dy) < std::tie(var.x, var.dx/var.dy);
}
However, this solution won't compile because it works with references.
bool operator < (const Bucket& var) const
{
int slope = dx/dy;
int var_slope = var.dx/var.dy;
std::tie(x, slope) < std::tie(var.x, var_slope);
}
It's not the most efficient solution, but readability is quite good.
Of course, you still have the division by 0 in this example.

Efficient operator+

I have to compute large sums of 3d vectors and a comparison of using a vector class with overloaded operator+ and operator* versus summing up of separate components shows a performance difference of about a factor of three. I know assume the difference must be due to construction of objects in the overloaded operators.
How can one avoid the construction and improve performance?
I'm espacially puzzled, because the following is afaik basically the standard way to do it and I would expect the compiler to optimize this. In real life, the sums are not going to be done within a loop but in quite large expressions (several tens of MBs in total pre executable) summing up different vectors, this is why operator+ is used below.
class Vector
{
double x,y,z;
...
Vector&
Vector::operator+=(const Vector &v)
{
x += v.x;
y += v.y;
z += v.z;
return *this;
}
Vector
Vector::operator+(const Vector &v)
{
return Vector(*this) += v; // bad: construction and copy(?)
}
...
}
// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];
// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
sumxx = sumxx + xx[i];
sumyy = sumyy + yy[i];
sumzz = sumzz + zz[i];
}
Any help is greatly appreciated.
EDIT:
Thank you all for your great input, I have the performance now at the same level.
#Dima's and especially #Xeo's answer did the trick. I wish I could mark more than one answer "accepted". I'll test some of the other suggestions too.
This article has a really good argumentation on how to optimize operators such as +, -, *, /.
Implement the operator+ as a free function like this in terms of operator+=:
Vector operator+(Vector lhs, Vector const& rhs){
return lhs += rhs;
}
Notice on how the lhs Vector is a copy and not a reference. This allowes the compiler to make optimizations such as copy elision.
The general rule that article conveys: If you need a copy, do it in the parameters, so the compiler can optimize. The article doesn't use this example, but the operator= for the copy-and-swap idiom.
Why not replace
sum = sum + vec[i];
with
sum += vec[i];
... that should eliminate two calls to the copy constructor and one call to the assignment operator for each iteration.
But as always, profile and know where the expense is coming instead of guessing.
You might be interested in expression templates.
I implemented most of the optimizations being proposed here and compared it with the performance of a function call like
Vector::isSumOf( Vector v1, Vector v2)
{
x = v1.x + v2.x;
...
}
Repeatedly executing same loop with a few billion vector summations for every method in alternating order, did not result in the promised gains.
In case of the member function posted by bbtrb, this method took 50% more time than the isSumOf() function call.
Free, non member operator+ (Xeo) method needed up to double the time (100% more) of the is SumOf() function.
(gcc 4.6.3 -O3)
I aware of the fact, that this was not a representative testing, but since i could not reproduce any performance gains by using operators at all. I suggest to avoid them, if possible.
Usually, operator + looks like:
return Vector (x + v.x, y + v.y, z + v.z);
with a suitably defined constructor. This allows the compiler to do return value optimisation.
But if you're compiling for IA32, then SIMD would be worth considering, along with changes to the algorithms to take advantage of the SIMD nature. Other processors may have SIMD style instructions.
I think the difference in performance is caused by the compiler optimization here. Adding up elements of arrays in a loop can be vectorized by the compiler. Modern CPUs have instructions for adding multiple numbers in a single clock tick, such as SSE, SSE2, etc. This seems to be a likely explanation for the factor of 3 difference that you are seeing.
In other words, adding corresponding elements of two arrays in a loop may generally be faster than adding corresponding members of a class. If you represent the vector as an array inside your class, rather than x, y, and z, you may get the same speedup for your overloaded operators.
Are the implementations to your Vector operator functions directly in the header file or are they in a separate cpp file? In the header file they would typically be inlined in an optimized build. But if they are compiled in a different translation unit, then they often won't be (depending on your build settings). If the functions aren't inlined, then the compiler won't be able to do the type of optimization you are looking for.
In cases like these, have a look at the disassembly. Even if you don't know much about assembly code it's usually pretty easy to figure out what's different in simple cases like these.
Actually if you look at any real matrix code the operator+ and the operator+= don't do that.
Because of the copying involved they introduce a pseudo object into the expression and only do the real work when the assignment is executed. Using lazy evaluation like this also allows NULL operations to be removed during expression evaluation:
class Matrix;
class MatrixOp
{
public: virtual void DoOperation(Matrix& resultInHere) = 0;
};
class Matrix
{
public:
void operator=(MatrixOp* op)
{
// No copying has been done.
// You have built an operation tree.
// Now you are goign to evaluate the expression and put the
// result into *this
op->DoOperation(*this);
}
MatrixOp* operator+(Matrix& rhs) { return new MatrixOpPlus(*this,rhs);}
MatrixOp* operator+(MatrixOp* rhs){ return new MatrixOpPlus(*this,rhs);}
// etc
};
Of course this is a lot more complex than I have portrayed here in this simplified example. But if you use a library that has been designed for matrix operations then it will have already been done for you.
Your Vector implementation:
Implement the operator+() like this:
Vector
Vector::operator+(const Vector &v)
{
return Vector(x + v.x, y + v.y, z + v.z);
}
and add the inline operator in your class definition (this avoids the stack pushs and pops of the return address and method arguments for each method call, if the compiler finds it useful).
Then add this constructor:
Vector::Vector(const double &x, const double &y, const double &z)
: x(x), y(y), z(z)
{
}
which lets you construct a new vector very efficiently (like you would do in my operator+() suggestion)!
In the code using your Vector:
You did:
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
Unroll this kind of loops! Doing only one operation (as it would be optimized to using the SSE2/3 extensions or something similar) in a very large loop is very inefficient. You should rather do something like this:
//Unrolled loop:
for(int i = 0; i <= N - 10; i += 10)
{
sum = sum + vec[i];
+ vec[i+1];
+ vec[i+2];
+ vec[i+3];
+ vec[i+4];
+ vec[i+5];
+ vec[i+6];
+ vec[i+7];
+ vec[i+8];
+ vec[i+9];
}
//Doing the "rest":
for(int i = (N / 10) * 10; i < N; ++i)
{
sum = sum + vec[i];
}
(Note that this code is untested and may contain a "off-by-one"-error or so...)
Note that you are asking different things because the data is not disposed in the same way in memory. When using Vector array the coordinates are interleaved "x1,y1,z1,x2,y2,z2,...", while with the double arrays you have "x1,x2,...,y1,y2,...z1,z2...". I suppose this could have an impact on compiler optimizations or how the caching handles it.

Initializing a C++ vector to random values... fast

Hey, id like to make this as fast as possible because it gets called A LOT in a program i'm writing, so is there any faster way to initialize a C++ vector to random values than:
double range;//set to the range of a particular function i want to evaluate.
std::vector<double> x(30, 0.0);
for (int i=0;i<x.size();i++) {
x.at(i) = (rand()/(double)RAND_MAX)*range;
}
EDIT:Fixed x's initializer.
Right now, this should be really fast since the loop won't execute.
Personally, I'd probably use something like this:
struct gen_rand {
double range;
public:
gen_rand(double r=1.0) : range(r) {}
double operator()() {
return (rand()/(double)RAND_MAX) * range;
}
};
std::vector<double> x(num_items);
std::generate_n(x.begin(), num_items, gen_rand());
Edit: It's purely a micro-optimization that might make no difference at all, but you might consider rearranging the computation to get something like:
struct gen_rand {
double factor;
public:
gen_rand(double r=1.0) : factor(range/RAND_MAX) {}
double operator()() {
return rand() * factor;
}
};
Of course, there's a really good chance the compiler will already do this (or something equivalent) but it won't hurt to try it anyway (though it's really only likely to help with optimization turned off).
Edit2: "sbi" (as is usually the case) is right: you might gain a bit by initially reserving space, then using an insert iterator to put the data into place:
std::vector<double> x;
x.reserve(num_items);
std::generate_n(std::back_inserter(x), num_items, gen_rand());
As before, we're into such microscopic optimization, I'm not at all sure I'd really expect to see a difference at all. In particular, since this is all done with templates, there's a pretty good chance most (if not all) the code will be generated inline. In that case, the optimizer is likely to notice that the initial data all gets overwritten, and skip initializing it.
In the end, however, nearly the only part that's really likely to make a significant difference is getting rid of the .at(i). The others might, but with optimizations turned on, I wouldn't really expect them to.
I have been using Jerry Coffin's functor method for some time, but with the arrival of C++11, we have loads of cool new random number functionality. To fill an array with random float values we can now do something like the following . . .
const size_t elements = 300;
std::vector<float> y(elements);
std::uniform_real_distribution<float> distribution(0.0f, 2.0f); //Values between 0 and 2
std::mt19937 engine; // Mersenne twister MT19937
auto generator = std::bind(distribution, engine);
std::generate_n(y.begin(), elements, generator);
See the relevant section of Wikipedia for more engines and distributions
Yes, whereas x.at(i) does bounds checking, x[i] does not do so. Also, your code is incorrect as you have failed to specify the size of x in advance. You need to use std::vector<double> x(n), where n is the number of elements that you want to use; otherwise, your loop there will never execute.
Alternatively, you may want to make a custom iterator for generating random values and filling it using the iterator; because the std::vector constructor will initialize its elements, anyway, so if you have a custom iterator class that generates random values you may be able to eliminate a pass over the items.
In terms of implementing an iterator of your own, here is my untested code:
class random_iterator
{
public:
typedef std::input_iterator_tag iterator_category;
typedef double value_type;
typedef int difference_type;
typedef double* pointer;
typedef double& reference;
random_iterator() : _range(1.0), _count(0) {}
random_iterator(double range, int count) :
_range(range), _count(count) {}
random_iterator(const random_iterator& o) :
_range(o._range), _count(o._count) {}
~random_iterator(){}
double operator*()const{ return ((rand()/(double)RAND_MAX) * _range); }
int operator-(const random_iterator& o)const{ return o._count-_count; }
random_iterator& operator++(){ _count--; return *this; }
random_iterator operator++(int){ random_iterator cpy(*this); _count--; return cpy; }
bool operator==(const random_iterator& o)const{ return _count==o._count; }
bool operator!=(const random_iterator& o)const{ return _count!=o._count; }
private:
double _range;
int _count;
};
With the code above, it should be possible to use:
std::vector<double> x(random_iterator(range,number),random_iterator());
That said, the generate code for the other solution given is simpler, and frankly, I would just explicitly fill the vector without resorting to anything fancy like this.... but it is kind of cool to think about.
#include <iostream>
#include <vector>
#include <algorithm>
struct functor {
functor(double v):val(v) {}
double operator()() const {
return (rand()/(double)RAND_MAX)*val;
}
private:
double val;
};
int main(int argc, const char** argv) {
const int size = 10;
const double range = 3.0f;
std::vector<double> dvec;
std::generate_n(std::back_inserter(dvec), size, functor(range));
// print all
std::copy(dvec.begin(), dvec.end(), (std::ostream_iterator<double>(std::cout, "\n")));
return 0;
}
опоздал :(
You may consider using a pseudo-random number generator that gives output as a sequence. Since most PRNGs just provide a sequence anyways, that will be a lot more efficient than simply calling rand() over and over again.
But then, I think I really need to know more about your situation.
Why does this piece of code execute so much? Can you restructure your code to avoid re-generating random data so frequently?
How big are your vectors?
How "good" does your random number generator need to be? High-quality distributions tend to be more expensive to calculate.
If your vectors are large, are you reusing their buffer space, or are you throwing it away and reallocating it elsewhere? Creating new vectors willy-nilly is a great way to destroy your cache.
#Jerry Coffin's answer looks very good. Two other thoughts, though:
Inlining - All of your vector access will be very fast, but if the call to rand() is out-of-line, the function call overhead might dominate. If that's the case, you may need to roll your own pseudorandom number generator.
SIMD - If you're going to roll your own PRNG, you might as well make it compute 2 doubles (or 4 floats) at once. This will reduce the number of the int-to-float conversions as well as the multiplications. I've never tried it, but apparently there's a SIMD version of the Mersenne Twister that's quite good. A simple linear congruential generator might be good enough too (and that's probably what rand() is using already).
int main() {
int size = 10;
srand(time(NULL));
std::vector<int> vec(size);
std::generate(vec.begin(), vec.end(), rand);
std::vector<int> vec_2(size);
std::generate(vec_2.begin(), vec_2.end(), [](){ return rand() % 50;})
}
You need to include vector, algorithm, time, cstdlib.
The way I think about these is a rubber-meets-the-road approach.
In other words, there are certain minimal things that have to happen, no getting around it, such as:
the rand() function has to be called N times.
the result of rand() has to be converted to double and then multiplied by something.
the resulting numbers have to get stored in consecutive elements of an array.
The object is, at a minimum, to get those things done.
Other concerns, like whether or not to use an std::vector and iterators are fine as long as they don't add any extra cycles.
The easiest way to see if they add significant extra cycles is to single-step the code at the assembly language level.