Vector move constructor slower than copy constructor - c++

I'm working on my first C++ project, which is a CSV parser (full source code here). It's at the point where it's working, and now I want to do basic refactoring / improve performance.
Currently the way the parser works is by returning each row as a std::vector<std::string>, and I figured that instead of allocating a new vector and a new string every time I'd just have an internal vector and internal string with reserved memory that I'd clear again and again.
That worked, and I started looking at other places where I might be doing memory allocation, and I saw this function which copies the internal vector, and then clears it:
auto add_row() -> std::vector<std::string> {
auto row(m_bufvec);
m_bufvec.clear();
return row;
}
I figured that if I instead changed this line
auto row(m_bufvec);
to
auto row(std::move(m_bufvec));
It'd result in some sort of speed boost because according to http://en.cppreference.com/w/cpp/container/vector/vector it would take constant time instead of linear. To my surprise, it made the parser significantly slower (according to my really rough benchmark of running time ./main.o over this file).
I'm completely new to optimization, benchmarking, and everything else that comes with tuning C++ code. Perhaps this optimization is useless even if it worked, but regardless, I'm curious as to why std::move causes a slowdown. Am I missing something?

When you copy bufvec, its capacity is unchanged, but when you move it, its capacity is cleared. Thus, later when you fill bufvec, a logarithmic number of allocations (and log-linear element copies/moves) that are done to expand its capacity again, and such allocations can easily be your performance bottleneck.
The move version makes that function faster. But it makes other code slower. Micro optimizations do not reliably make programs faster.
Edit by OP:
The solution proposed by Cheers and hth. - Alf in the comments of m_bufvec.reserve(row.size()) after the move fixes the problem, and confirms that the above reasoning was correct. Moreover it is more efficient, (albeit only slightly) because
you avoid copying the items [in bufvec]. If the items are simple integer values, that doesn't matter so much. If the items are e.g. strings, with dynamic allocation, then it really does matter.

Indeed the first version is expected to be faster. The reason is:
auto row(m_bufvec);
invokes the copy constuctor, which allocates the necessary memory for row just at once. bufvec also keeps its allocated memory. As a result, allocations per-element are minimized, and this is important because they involve an amount of relocations.
In the second version, auto row(std::move(m_bufvec)); bufvec's memory becomes owned by row, this operation is faster than the copy constructor. But as bufvec has lost its allocated memory, when you later fill it element by element, it will do many re-allocations and (expensive) relocation. The number of re-allocations is usually logarithmic with the final size of the vector.
EDIT
The above explains the "unexpected" results in the main question. Finally, it turns out that the "ideal" for this operation is to move then reserve immediately:
auto row(std::move(m_bufvec);
m_bufvec.reserve(row.size());
return row;
This achieves the three goals:
no element-by-element allocation
no useless initialization for bufvec
no useless copying of elements from m_bufvec into row.

Related

Vector re-declaration versus insertions in looping operations - C++

I have an option to either create and destroy a vector on every call to func() and push elements in each iteration, as shown in Example A OR fixed the initialization and only overwrite old values in each iteration, as shown in Example B.
Example A:
void func ()
{
std::vector<double> my_vec(5, 0.0);
for ( int i = 0; i < my_vec.size(); i++) {
my_vec.push_back(i);
// do something
}
}
while (condition) {
func();
}
Example B:
void func (std::vector<double>& my_vec)
{
for ( int i = 0; i < my_vec.size(); i++) {
my_vec[i] = i;
// do something
}
}
while (condition) {
std::vector<double> my_vec(5, 0.0);
func(myVec);
}
Which of the two would be computationally inexpensive. The size of the array won't be more than 10.
I still suspect that the question that was asked is not the question that was intended, but it occurred to me that the main point of my answer would likely not change. If the question gets updated, I can always edit this answer to match (or delete it, if it turns out to be inapplicable).
De-prioritize optimizations
There are various factors that should affect how you write your code. Among the desirable goals are space optimization, time optimization, data encapsulation, logic encapsulation, readability, robustness, and correct functionality. Ideally, all of these goals would be achievable in every piece of code, but that is not particularly realistic. Much more likely is a situation where one or more of these goals must be sacrificed in favor of the others. When that happens, optimizations should typically yield to everything else.
That is not to say that optimizations should be ignored. There are plenty of optimizations that rarely obstruct the higher-priority goals. These range from the small, such as passing by const reference instead of by value, to the large, such as choosing the logarithmic algorithm instead of the exponential one. However, the optimizations that do interfere with the other goals should be postponed until after your code is reasonably complete and functioning correctly. At that point, a profiler should be used to determine where the bottlenecks actually are. Those bottlenecks are the only places where other goals should yield to optimizations, and only if the profiler confirms that the optimizations achieved their goals.
For the question being asked, this means that the main concern should not be computational expense, but encapsulation. Why should the caller of func() need to allocate space for func() to work with? It should not, unless a profiler identified this as a performance bottleneck. And if a profiler did that, it would be much easier (and more reliable!) to ask the profiler if the change helps than to ask Stack Overflow.
Why?
I can think of two major reasons to de-prioritize optimizations. First, the "sniff test" is unreliable. While there might be a few people who can identify bottlenecks by looking at code, there are many, many more who merely think they can. Second, that's why we have optimizing compilers. It is not unheard of for someone to come up with this super-clever optimization trick only to discover that the compiler was already doing it. Keep your code clean and let the compiler handle the routine optimizations. Only step in when the task demonstrably exceeds the compiler's capabilities.
See also: premature-optimization
Choosing an optimization
OK, suppose the profiler did identify construction of this small, 10-element array as a bottleneck. The next step is to test an alternative, right? Almost. First you need an alternative, and I would consider a review of the theoretical benefits of various alternatives to be useful. Just keep in mind that this is theoretical and that the profiler gets the final say. So I'll go into the pros and cons of the alternatives from the question, as well as some other alternatives that might bear consideration. Let's start from the worst options, working our way to the better ones.
Example A
In Example A, a vector is created with 5 elements, then elements are pushed onto the vector until i meets or exceeds the vector's size. Seeing how i and the vector's size are both increased by one each iteration (and i starts smaller than the size), this loop will run until the vector grows large enough to crash the program. That means probably billions of iterations (despite the question's claim that the size will not exceed 10).
Easily the most computationally expensive option. Don't do this.
Example B
In example B, a vector is created for each iteration of the outer while loop, which is then accessed by reference from within func(). The performance cons here include passing a parameter to func() and having func() access the vector indirectly through a reference. There are no performance pros as this does everything the baseline (see below) would do, plus some extra steps.
Even though a compiler might be able to compensate for the cons, I see no reason to try this approach.
Baseline
The baseline I'm using is a fix to Example A's infinite loop. Specifically, replace "my_vec.push_back(i);" with Example B's "my_vec[i] = i;". This simple approach is along the lines of what I would expect for the initial assessment by the profiler. If you cannot beat simple, stick with it.
Example B*
The text of the question presents an inaccurate assessment of Example B. Interestingly, the assessment describes an approach that has the potential to improve on the baseline. To get code that matches the textual description, move Example B's "std::vector<double> my_vec(5, 0.0);" to the line immediately before the while statement. This has the effect of constructing the vector only once, rather than constructing it with each iteration.
The cons of this approach are the same as those of Example B as originally coded. However, we now pick up a gain in that the vector's constructor is called only once. If construction is more expensive than the indirection costs, the result should be a net improvement once the while loop iterates often enough. (Beware these conditions: that's a significant "if" and there is no a priori guess as to how many iterations is "enough".) It would be reasonable to try this and see what the profiler says.
Get some static
A variant on Example B* that helps preserve encapsulation is to use the baseline (the fixed Example A), but precede the declaration of the vector with the keyword static. This brings in the benefit of constructing the vector only once, but without the overhead associated with making the vector a parameter. In fact, the benefit could be greater than in Example B* since construction happens only once per program execution, rather than each time the while loop is started. The more times the while loop is started, the greater this benefit.
The main con here is that the vector will occupy memory throughout the program's execution. Unlike Example B*, it will not release its memory when the block containing the while loop ends. Using this approach in too many places would lead to memory bloat. So while it is reasonable to profile this approach, you might want to consider other options. (Of course if the profiler calls this out as the bottleneck, dwarfing all others, the cost is small enough to pay.)
Fix the size
My personal choice for what optimization to try here would be to start from the baseline and switch the vector to std::array<10,double>. My main motivation is that the needed size won't be more than 10. Also relevant is that the construction of a double is trivial. Construction of the array should be on par with declaring 10 variables of type double, which I would expect to be negligible. So no need for fancy optimization tricks. Just let the compiler do its thing.
The expected possible benefit of this approach is that a vector allocates space on the heap for its storage, which has an overhead cost. The local array would not have this cost. However, this is only a possible benefit. A vector implementation might already take advantage of this performance consideration for small vectors. (Maybe it does not use the heap until the capacity needs to exceed some magic number, perhaps more than 10.) I would refer you back to earlier when I mentioned "super-clever" and "compiler was already doing it".
I'd run this through the profiler. If there's no benefit, there is likely no benefit from the other approaches. Give them a try, sure, since they're easy enough, but it would probably be a better use of your time to look at other aspects to optimize.

Faster alternative to push_back(size is known)

I have a float vector. As I process certain data, I push it back.I always know what the size will be while declaring the vector.
For the largest case, it is 172,490,752 floats. This takes about eleven seconds just to push_back everything.
Is there a faster alternative, like a different data structure or something?
If you know the final size, then reserve() that size after you declare the vector. That way it only has to allocate memory once.
Also, you may experiment with using emplace_back() although I doubt it will make any difference for a vector of float. But try it and benchmark it (with an optimized build of course - you are using an optimized build - right?).
The usual way of speeding up a vector when you know the size beforehand is to call reserve on it before using push_back. This eliminates the overhead of reallocating memory and copying the data every time the previous capacity is filled.
Sometimes for very demanding applications this won't be enough. Even though push_back won't reallocate, it still needs to check the capacity every time. There's no way to know how bad this is without benchmarking, since modern processors are amazingly efficient when a branch is always/never taken.
You could try resize instead of reserve and use array indexing, but the resize forces a default initialization of every element; this is a waste if you know you're going to set a new value into every element anyway.
An alternative would be to use std::unique_ptr<float[]> and allocate the storage yourself.
::boost::container::stable_vector Notice that allocating a contiguous block of 172 *4 MB might easily fail and requires quite a lot page joggling. Stable vector is essentially a list of smaller vectors or arrays of reasonable size. You may also want to populate it in parallel.
You could use a custom allocator which avoids default initialisation of all elements, as discussed in this answer, in conjunction with ordinary element access:
const size_t N = 172490752;
std::vector<float, uninitialised_allocator<float> > vec(N);
for(size_t i=0; i!=N; ++i)
vec[i] = the_value_for(i);
This avoids (i) default initializing all elements, (ii) checking for capacity at every push, and (iii) reallocation, but at the same time preserves all the convenience of using std::vector (rather than std::unique_ptr<float[]>). However, the allocator template parameter is unusual, so you will need to use generic code rather than std::vector-specific code.
I have two answers for you:
As previous answers have pointed out, using reserve to allocate the storage beforehand can be quite helpful, but:
push_back (or emplace_back) themselves have a performance penalty because during every call, they have to check whether the vector has to be reallocated. If you know the number of elements you will insert already, you can avoid this penalty by directly setting the elements using the access operator []
So the most efficient way I would recommend is:
Initialize the vector with the 'fill'-constructor:
std::vector<float> values(172490752, 0.0f);
Set the entries directly using the access operator:
values[i] = some_float;
++i;
The reason push_back is slow is that it will need to copy all the data several times as the vector grows, and even when it doesn’t need to copy data it needs to check. Vectors grow quickly enough that this doesn’t happen often, but it still does happen. A rough rule of thumb is that every element will need to be copied on average once or twice; the earlier elements will need to be copied a lot more, but almost half the elements won’t need to be copied at all.
You can avoid the copying, but not the checks, by calling reserve on the vector when you create it, ensuring it has enough space. You can avoid both the copying and the checks by creating it with the right size from the beginning, by giving the number of elements to the vector constructor, and then inserting using indexing as Tobias suggested; unfortunately, this also goes through the vector an extra time initializing everything.
If you know the number of floats at compile time and not just runtime, you could use an std::array, which avoids all these problems. If you only know the number at runtime, I would second Mark’s suggestion to go with std::unique_ptr<float[]>. You would create it with
size_t size = /* Number of floats */;
auto floats = unique_ptr<float[]>{new float[size]};
You don’t need to do anything special to delete this; when it goes out of scope it will free the memory. In most respects you can use it like a vector, but it won’t automatically resize.

Is returning elements slower than sending them by reference and modifying there?

Suppose I have a function that produces a big structure (in this case, a huge std::vector), and a loop that calls it repeatedly:
std::vector<int> render(int w, int h, int time){
std::vector<int> result;
/* heavyweight drawing procedures */
return result;
};
while(loop){
std::vector<int> image = render(800,600,time);
/*send image to graphics card*/
/*...*/
};
My question is: in cases like that, is GCC/Clang smart enough to avoid allocating memory for that huge 800x600x4 array on every iteration? In other words, does this code perform similar to:
void render(int w, int h, int time, std::vector<int>& image){ /*...*/ }
std::vector<int> image;
while(loop){
render(800,600,time,image);
/*...*/
}
Why the question: I'm making a compiler from a language to C++ and I have to decide which way I go; if I compile it like the first example or like the last one. The first one would be trivial; the last one would need some tricky coding, but could be worth if it is considerably faster.
Returning all but the most trivial of objects by value will be slower 99% of the time. The amount of work to construct a copy of the entire std::vector<int>, if the length of the vector is unbounded, will be substantial. Also this is a good way to potential underflow your stack, if say your vector ends up with 1,000,000 elements in it. In your first example, the image vector will also be copy constructed and destructed each time through your loop. You can always compile your code w/the -pg option to turn on gprof data and check your results.
The biggest problem is not the allocation of memory, it's the copying of the entire vector that happens at return. So the second options is much better. In your second example you are also re-using the same vector, which will not allocate memory for each iteration (unless you do image.swap(smth) at some point).
The compiler can help with copy elision, but this is not the major issue here. You could also eliminate that copy explicitly by inlining that function (you may read about rvalue references and move semantics for additional info)
The actual problem might not be solved by the compiler. Even though just one vector instance exists at a time, there would always be the overhead of properly allocating and deallocating the heap memory of that temporary vector on construction and destruction. How this performs would then solely depend upon the underlying allocator implementation (std::cllocator, new, malloc(),...) of the standard library. The allocator could be smart and preserve that memory for quick reuse, but maybe, it is not (apart from the fact, that you could replace the vector's allocator with a custom, smart one). Furthermore, it also depends on the allocated memory size, available physical memory and OS. Large blocks (relative to total memory) would be returned earlier. Linux could do over commit (giving more memory than actually available). But since the vector implementation or your renderer, respectively initializes (uses) all memory by default, it is of no use here.
--> go for 2.

Variable sized char array with minimizing calls to new?

I need a char array that will dynamically change in size. I do not know how big it can get so preallocating is not an option. It could never get any bigger than 20 bytes 1 time, the next time it may get up to 5kb...
I want the allocation to be like a std vector.
I thought of using a std vector < char > but all those push backs seem like they waste time:
strVec.clear();
for(size_t i = 0; i < varLen; ++i)
{
strVec.push_back(0);
}
Is this the best I can do or is there a way to add a bunch of items to a vector at once? Or maybe a better way to do this.
Thanks
std::vector doesn't allocate memory every time you call push_back, but only when the size becomes bigger than the capacity
First, don't optimize until you've profiled your code and determined that there is a bottleneck. Consider the costs to readability, accessibility, and maintainability by doing something clever. Make sure any plan you take won't preclude you from working with Unicode in future. Still here? Alright.
As others have mentioned, vectors reserve more memory than they use initially, and push_back usually is very cheap.
There are cases when using push_back reallocates memory more than is necessary, however. For example, one million calls to myvector.push_back() might trigger 10 or 20 reallocations of myvector. On the other hand, inserting into a vector at its end will cause at most 1 reallocation of myvector*. I generally prefer the insertion idiom to the reserve / push_back idiom for both speed and readability reasons.
myvector.insert(myvector.end(), inputBegin, inputEnd)
If you do not know the size of your string in advance and cannot tolerate the hiccups caused by reallocations, perhaps because of hard real-time constraints, then maybe you should use a linked list. A linked list will have consistent performance at the price of much worse average performance.
If all of this isn't enough for your purposes, consider other data structures such as a rope or post back with more specifics about your case.
From Scott Meyer's Effective STL, IIRC
You can use the resize member function to add a bunch. However, I would not expect that push_back would be slow, especially if the vector's internal capacity is already non-trivial.
Is this the best I can do or is there a way to add a bunch of items to a vector at once? Or maybe a better way to do this.
push_back isn't very slow, it just compares the size to the current capacity and reallocates if necessary. The comparison may work out to essentially zero time because of branch prediction and superscalar execution on the CPU. The reallocation is performed O(log N) times, so the vector uses up to twice as much memory as needed but time spent on reallocation seldom adds up to anything.
To insert several items at once, use insert. There are a few overloads, the only trick is that you need to explicitly pass end.
my_vec.insert( my_vec.end(), num_to_add, initial_value );
my_vec.insert( my_vec.end(), first, last ); // iterators or pointers
For the second form, you could put the values in an array first and then copy the array to the end of the vector. But this might add as much complexity as it removes. That's how it goes with micro-optimization. Only attempt to optimize if you know there's a measurable gain to be had.

Creation of a template class creates major bottleneck

I am trying to write a scientific graph library, it works but I have some performance problems. When creating a graph I use a template class for the nodes and do something like
for(unsigned int i = 0; i < l_NodeCount; ++i)
m_NodeList.push_back(Node<T>(m_NodeCounter++));
Even though in the constructor of the node class almost nothing happens (a few variables are asigned) this part is a major bottleneck of my program (when I use over a million of nodes), especially in the debug mode it becomes too inefficient to run at all.
Is there a better way to simultaneusly create all those template classes without having to call the constructor each time or do I have to rewrite it without templates?
If the constructor does almost nothing, as you say, the bottleneck is most likely the allocation of new memory. The vector grows dynamically, and each time it's memory is exhausted, it will reserve new memory and copy all data there. When adding a large number of objects, this can happen very frequently, and become very expensive. This can be avoided by calling
m_NodeList.reserve(l_NodeCount);
With this call, the vector will allocate enough memory to hold l_NodeCount objects, and you will not have any expensive reallocations when bulk-adding the elements.
There are things that happen in your code:
as you add elements to the vector, it occasionally has to resize the internal array, which involves copying all existing elements to the new array
the constructor is called for each element
The constructor call is unavoidable. You create a million elements, you have a million constructor calls. What you can change is what the constructor does.
Adding elements is obviously unavoidable too, but the copying/resizing can be avoided. Call reserve on the vector initially, to reserve enough space for all your nodes.
Depending on your compiler, optimization settings and other flags, the compiler may do a lot of unnecessary bounds checking and debug checks as well.
You can disable this for the compiler (_SECURE_SCL=0 on VS2005/2008, _ITERATOR_DEBUG_LEVEL=0 in VS2010. I believe it's off by default in GCC, and don't know about other compilers).
Alternatively, you can rewrite the loop to minimize the amount of debug checking that needs to be done. Using the standard library algorithms instead of a raw loop allows the library to skip most of the checks (typically, a bounds check will then be performed on the begin and the end iterator, and not on the intervening iterations, whereas on a plain loop, it'll be done every time an iterator is dereferenced)
I would say, your bottleneck is not a template class which has nothing to do with run-time and is dealt with during compilation, but adding an element to vector container (you have tag "vector" in your question). You are performing A LOT of allocations using push_back. Try allocating required total memory right away and then fill elements.
you can avoid the templates buy having a list of (void *) to the objects. and cast them later.
but... if you wish to have 1,000,000 instances of the node class. you will have to call 1,000,000 the node constructor.