I am a novice programmer and have only briefly covered the anatomy of a function call (setting up the stack, etc.). I can write a function two different ways and I'm wondering which (if either) is more efficient. This is for a finite element program so this function could be called several thousand times. It is using the linear algebra library Aramdillo.
First way:
void Q4::stiffness(mat &stiff)
{
stiff.zeros; // sets all elements of the matrix to zero
// a bunch of linear algebra calculations
// ...
stiff *= h;
}
int main()
{
mat elementStiffness(Q4__DOF, Q4__DOF);
mat globalStiffness(totalDOF, totalDOF);
for (int i = 0; i < reallyHugeNumber; i++)
{
elements[i].stiffness(&elementStiffness, PSTRESS);
assemble(&globalStiffness, &elementStiffness);
}
return 0;
}
Second way:
mat Q4::stiffness()
{
mat stiff(Q4__DOF, Q4__DOF); // initializes element stiffness matrix
// a bunch of linear algebra calculations
// ...
return stiff *= h;
}
int main()
{
mat elementStiffness(Q4__DOF, Q4__DOF);
mat globalStiffness(totalDOF, totalDOF);
for (int i = 0; i < reallyHugeNumber; i++)
{
elementStiffness = elements[i].stiffness(PSTRESS);
assemble(&globalStiffness, &elementStiffness);
}
return 0;
}
I think what I'm asking is: using the second way is mat stiff pushed to the stack and then copied into elementStiffness? Because I imagine the matrix being pushed to the stack and then being copied is much more expensive than passing a matrix be reference and setting its elements to zero.
Passing a variable by reference and doing your calculations on that variable is a lot cheaper. When c++ returns a variable, it pretty much copies it twice.
First inside the function, and then it calls the copy constructor or assignment operator, depending on if the value is being assigned to a new variable or to an existing variable, to initialize the variable. If you have a user-defined variable with a long list of internal state variables then this assignment operation is going to take a big chunk of the operator's processing time.
EDIT#1: I forgot about c++11 and the std::move. Many compilers can optimize functions like this so they can use std::move and instead of copying an lvaue it can copy an rvalue which is just the memory location.
On the surface, I think the second way will be much more expensive as it both constructs a new mat and copies it to the stack on every call. Of course that depends a bit on how often the mat construction takes place in the first way.
That said, I think the best thing to do is setup an experiment and test to make sure (agreeing with the suggestion to research).
Related
I am trying to use an opencv function that accepts std::vector<cv::Point3f> among other parameters. In my program, I have an std::vector<cv::Point3f> worldPoints and another std::vector<int> mask, both of larger dimension than what I want to send.
What I want to do is pass to the opencv function only the entries that have a respective non-zero mask, as efficiently as possible.
std::vector<cv::Point3f> worldPointsSubset;
for (int i=0; i<mask.size(); i++) {
if (mask[i] != 0) {
worldPointsSubset.push_back(worldPoints[i]);
}
}
// Then use worldPointsSubset in function
Is there any other way around, possibly involving no copying of data?
EDIT 1: The function I am referring to is solvePnPRansac()
The function that you call requires a vector of Point3f, so if the only thing you have is a masked vector, then you have to copy the data first. There is no way around this if the function doesn't accept a vector and its mask.
To see if this copy is an issue, you must measure the drop in performance first and see if this copy is a bottleneck. If it is a bottleneck, the first thing is to count the number of points you need and reserve that capacity in worldPointsSubset.
There is no way to convert data from std::vector<int> to std::vector<cv::Point3f> without a copy because despite the fact you see the same values the size of data might be different.
But you can change the type of data you are working on (std::vector<int> to std::vector<cv::Point3f>) and work directly with cv::Point3f and when needed pass it to solvePnPRansac().
I have a function which requires me to pass a fairly large matrix (which I created using Eigen) - and ranges from dimensions 200x200 -> 1000x1000. The function is more complex than this, but the bare bones of it are:
#include <Eigen/Dense>
int main()
{
MatrixXi mIndices = MatrixXi::Zero(1000,1000);
MatrixXi* pMatrix = &mIndices;
MatrixXi mTest;
for(int i = 0; i < 10000; i++)
{
mTest = pMatrix[0];
// Then do stuff to the copy
}
}
Is the reason that it takes much longer to run with a larger size of matrix because it takes longer to find the available space in RAM for the array when I set it equal to mTest? When I switch to a sparse array, this seems to be quite a lot quicker.
If I need to pass around large matrices, and I want to minimise the incremental effect of matrix size on runtime, then what is best practice here? At the moment, the same program is running slower in c++ than it is in Matlab, and obviously I would like to speed it up!
Best,
Ben
In the code you show, you are copying a 1,000,000 element 10,000 times. The assignment in the loop creates a copy.
Generally if you're passing an Eigen matrix to another function, it can be beneficial to accept the argument by reference.
It's not really clear from your code what you're trying to achieve however.
I'm in the mood for some premature optimization and was wondering the following.
If one has a for-loop, and inside that loop there is a call to a function that returns a container, say a vector, of which the value is caught as an rvalue into a variable in the loop using move semantics, for instance:
std::vector<any_type> function(int i)
{
std::vector<any_type> output(3);
output[0] = i;
output[1] = i*2;
output[2] = i-3;
return(output);
}
int main()
{
for (int i = 0; i < 10; ++i)
{
// stuff
auto value = function(i);
// do stuff with value ...
// ... but in such a way that it can be discarded in the next iteration
}
}
How do compilers handle this memory-wise in the case that move semantics are applied (and that the function will not be inlined)? I would imagine that the most efficient thing to do is to allocate a single piece of memory for all the values, both inside the function and outside in the for-loop, that will get overwritten in each iteration.
I am mainly interested in this, because in my real-life application the vectors I'm creating are a lot larger than in the example given here. I am concerned that if I use functions like this, the allocation and destruction process will take up a lot of useless time, because I already know that I'm going to use that fixed amount of memory a lot of times. So, what I'm actually asking is whether there's some way that compilers would optimize to something of this form:
void function(int i, std::vector<any_type> &output)
{
// fill output
}
int main()
{
std::vector<any_type> dummy; // allocate memory only once
for (int i = 0; i < 10; ++i)
{
// stuff
function(i, dummy);
// do stuff with dummy
}
}
In particular I'm interested in the GCC implementation, but would also like to know what, say, the Intel compiler does.
Here, the most predictable optimization is RVO. When a function return an object, if it is used to initialize a new variable, the compiler can elide additional copy and move to construct directly on the destination ( it means that a program can contains two versions of the function depending on the use case ).
Here, you will still pay for allocating and destroying a buffer inside the vector at each loo iteration. If it is unacceptable, you will have to rely on an other solution, like std::array as your function seems to use fixed size dimension or move the vector before the loop and reuse it.
I would imagine that the most efficient thing to do is to allocate a
single piece of memory for all the values, both inside the function
and outside in the for-loop, that will get overwritten in each
iteration.
I don't think that any of the current compilers can do that. (I would be stunned to see that.) If you want to get insights, watch Chandler Carruth's talk.
If you need this kind of optimization, you need to do it yourself: Allocate the vector outside the loop and pass it by non-const reference to function() as argument. Of course, don't forget to call clear() when you are done or call clear() first inside function().
All this has nothing to do with move semantics, nothing has changed with C++11 in this respect.
If your loop is a busy loop, than allocating a container in each iteration can cost you a lot. It's easier to find yourself in such a situation than you would probably expect. Andrei Alexandrescu presents an example in his talk Writing Quick Code in C++, Quickly. The surprising thing is that doing unnecessary heap allocations in a tight loop like the one in his example can be slower than the actual file IO. I was surprised to see that. By the way, the container was std::string.
Deal all, I have implemented some functions and like to ask some basic thing as I do not have a sound fundamental knowledge on C++. I hope, you all would be kind enough to tell me what should be the good way as I can learn from you. (Please, this is not a homework and i donot have any experts arround me to ask this)
What I did is; I read the input x,y,z, point data (around 3GB data set) from a file and then compute one single value for each point and store inside a vector (result). Then, it will be used in next loop. And then, that vector will not be used anymore and I need to get that memory as it contains huge data set. I think I can do this in two ways.
(1) By just initializing a vector and later by erasing it (see code-1). (2) By allocating a dynamic memory and then later de-allocating it (see code-2). I heard this de-allocation is inefficient as de-allocation again cost memory or maybe I misunderstood.
Q1)
I would like to know what would be the optimized way in terms of memory and efficiency.
Q2)
Also, I would like to know whether function return by reference is a good way of giving output. (Please look at code-3)
code-1
int main(){
//read input data (my_data)
vector<double) result;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result.push_back(value);
// do some other stuff
//loop over result and use each value for some other stuff
for (int i=0; i<result.size(); i++){
//do some stuff
}
//result will not be used anymore and thus erase data
result.clear()
code-2
int main(){
//read input data
vector<double) *result = new vector<double>;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result->push_back(value);
// do some other stuff
//loop over result and use each value for some other stuff
for (int i=0; i<result->size(); i++){
//do some stuff
}
//de-allocate memory
delete result;
result = 0;
}
code03
vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
{
vector<Position3D> *points_at_grid_cutting = new vector<Position3D>;
vector<Position3D>::iterator point;
for (point=begin(); point!=end(); point++) {
//do some stuff
}
return (*points_at_grid_cutting);
}
For such huge data sets I would avoid using std containers at all and make use of memory mapped files.
If you prefer to go on with std::vector, use vector::clear() or vector::swap(std::vector()) to free memory allocated.
erase will not free the memory used for the vector. It reduces the size but not the capacity, so the vector still holds enough memory for all those doubles.
The best way to make the memory available again is like your code-1, but let the vector go out of scope:
int main() {
{
vector<double> result;
// populate result
// use results for something
}
// do something else - the memory for the vector has been freed
}
Failing that, the idiomatic way to clear a vector and free the memory is:
vector<double>().swap(result);
This creates an empty temporary vector, then it exchanges the contents of that with result (so result is empty and has a small capacity, while the temporary has all the data and the large capacity). Finally, it destroys the temporary, taking the large buffer with it.
Regarding code03: it's not good style to return a dynamically-allocated object by reference, since it doesn't provide the caller with much of a reminder that they are responsible for freeing it. Often the best thing to do is return a local variable by value:
vector<Position3D> ReturnLabel(VoxelGrid grid, int segment) const
{
vector<Position3D> points_at_grid_cutting;
// do whatever to populate the vector
return points_at_grid_cutting;
}
The reason is that provided the caller uses a call to this function as the initialization for their own vector, then something called "named return value optimization" kicks in, and ensures that although you're returning by value, no copy of the value is made.
A compiler that doesn't implement NRVO is a bad compiler, and will probably have all sorts of other surprising performance failures, but there are some cases where NRVO doesn't apply - most importantly when the value is assigned to a variable by the caller instead of used in initialization. There are three fixes for this:
1) C++11 introduces move semantics, which basically sort it out by ensuring that assignment from a temporary is cheap.
2) In C++03, the caller can play a trick called "swaptimization". Instead of:
vector<Position3D> foo;
// some other use of foo
foo = ReturnLabel();
write:
vector<Position3D> foo;
// some other use of foo
ReturnLabel().swap(foo);
3) You write a function with a more complicated signature, such as taking a vector by non-const reference and filling the values into that, or taking an OutputIterator as a template parameter. The latter also provides the caller with more flexibility, since they need not use a vector to store the results, they could use some other container, or even process them one at a time without storing the whole lot at once.
Your code seems like the computed value from the first loop is only used context-insensitively in the second loop. In other words, once you have computed the double value in the first loop, you could act immediately on it, without any need to store all values at once.
If that's the case, you should implement it that way. No worries about large allocations, storage or anything. Better cache performance. Happiness.
vector<double) result;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result.push_back(value);
If the "result" vector will end up having thousands of values, this will result in many reallocations. It would be best if you initialize it with a large enough capacity to store, or use the reserve function :
vector<double) result (someSuitableNumber,0.0);
This will reduce the number of reallocation, and possible optimize your code further.
Also I would write : vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
Like this :
void vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment, vector<Position3D> & myVec_out) const //myVec_out is populated inside func
Your idea of returning a reference is correct, since you want to avoid copying.
`Destructors in C++ must not fail, therefore deallocation does not allocate memory, because memory can't be allocated with the no-throw guarantee.
Apart: Instead of looping multiple times, it is probably better if you do the operations in an integrated manner, i.e. instead of loading the whole dataset, then reducing the whole dataset, just read in the points one by one, and apply the reduction directly, i.e. instead of
load_my_data()
for_each (p : my_data)
result.push_back(p)
for_each (p : result)
reduction.push_back (reduce (p))
Just do
file f ("file")
while (f)
Point p = read_point (f)
reduction.push_back (reduce (p))
If you don't need to store those reductions, simply output them sequentially
file f ("file")
while (f)
Point p = read_point (f)
cout << reduce (p)
code-1 will work fine and is almost the same as code-2, with no major advantages or disadvantages.
code03 Somebody else should answer that but i believe the difference between a pointer and a reference in this case would be marginal, I do prefer pointers though.
That being said, I think you might be approaching the optimization from the wrong angle. Do you really need all points to compute the output of a point in your first loop? Or can you rewrite your algorithm to read only one point, compute the value as you would in your first loop and then use it immediately the way you want to? Maybe not with single Points, but with batches of points. That could potentially cut back on your memory require quite a bit with only a small increase in processing time.
Im wondering if this code:
int main(){
int p;
for(int i = 0; i < 10; i++){
p = ...;
}
return 0
}
is exactly the same as that one
int main(){
for(int i = 0; i < 10; i++){
int p = ...;
}
return 0
}
in term of efficiency ?
I mean, the p variable will be recreated 10 times in the second example ?
It's is the same in terms of efficiency.
It's not the same in terms of readability. The second is better in this aspect, isn't it?
It's a semantic difference which the code keeps hidden because it's not making a difference for int, but it makes a difference to the human reader. Do you want to carry the value of whatever calculation you do in ... outside of the loop? You don't, so you should write code that reflects your intention.
A human reader will need to seek the function and look for other uses of p to confirm himself that what you did was just premature "optimization" and didn't have any deeper purpose.
Assuming it makes a difference for the type you use, you can help the human reader by commenting your code
/* p is only used inside the for-loop, to keep it from reallocating */
std::vector<int> p;
p.reserve(10);
for(int i = 0; i < 10; i++){
p.clear();
/* ... */
}
In this case, it's the same. Use the smallest scope possible for the most readable code.
If int were a class with a significant constructor and destructor, then the first (declaring it outside the loop) can be a significant savings - but inside you usually need to recreate the state anyway... so oftentimes it ends up being no savings at all.
One instance where it might make a difference is containers. A string or vector uses internal storage that gets grown to fit the size of the data it is storing. You may not want to reconstruct this container each time through the loop, instead, just clear its contents and it may not need as many reallocations inside the loop. This can (in some cases) result in a significant performance improvement.
The bottom-line is write it clearly, and if profiling shows it matters, move it out :)
They are equal in terms of efficiency - you should trust your compiler to get rid of the immeasurably small difference. The second is better design.
Edit: This isn't necessarily true for custom types, especially those that deal with memory. If you were writing a loop for any T, I'd sure use the first form just in case. But if you know that it's an inbuilt type, like int, pointer, char, float, bool, etc. I'd go for the second.
In second example the p is visible only inside of the for loop. you cannot use it further in your code.
In terms of efficiency they are equal.