I have an array,
int a[size];
I want to set all the array elements to 1
because some of the indexes in the array are already set to 1 so would it better checking each element using a conditional statement like,
for (int index = 0; index < size; index++)
{
if (a[index] != 1)
a[index] = 1;
}
or set all the indexes no matter what. what would be the difference?
Your code has two paths in the loop, depending on each value:
Read from array, comparison, and branch
Read from array, comparison, and write
That's not worth it. Just write.
If you want, you can do the same by calling
std::fill(a, a + size, 1);
If the array is of type char instead of int, it will likely call memset. And platform-specific implementations of fill can offer the compiler optimization hints.
Just set all the elements to 1. Code for simplicity and readability first. If you find that the code runs too slow, profile it to see where improvements need to be made (although I highly doubt performance problems can come from setting elements of an array of integers to a certain value).
I'm guessing you are just looking for understanding and not battling a real performance issue... this just wouldn't show up under measurement and here's why:
Normally whenever a cached memory processor (i.e. most of today's desktop CPUs) has to write a value to memory, the cache line that contains the address must be read from (relatively slow) RAM. The value is then modified by a CPU write to the cache. The entire cache line is eventually written back to main RAM.
When you are performing operations over a range of continuous addresses like your array, the CPU will be able to perform several operations very quickly over one cache line before it is written back. It then moves on to the next cache line which was previously fetched in anticipation.
Most likely performing the test before writing the value will not be measurably different than just writing for several reasons:
Branch prediction makes this process extremely efficient.
The compiler will have done some really powerful optimizations.
The memory transfer to cache RAM will be the real rate determining step.
So just write your code for clarity. Measure the difference if you are still curious.
Use an std::vector instead.
#include <vector>
...
std::vector<int> a(10, 1);
// access elements just as you would with a C array
std::cout << "Second element is: " << a[1] << std::endl;
Now you have an array of 10 integers all set to 1. If you already have an initialised vector, i.e. a vector filled with values other than one, you can use fill, like this:
#include <algorithm>
...
std::fill(a.begin(), a.end(), 1);
I wouldn't expect there to be a noticeable difference unless size is a very large value - however, if you're wanting the optimal variant then just setting all values to 1 would be the more performant option - I'm certain that the conditional will take more time than a simple assignment even if the assignment is then deemed not needed.
With C++11, you can use a the range-based for to set all values:
int a[size];
for(auto &v: a) {
v = 1;
}
The &v iterates by reference, so the loop variable is assignable.
This format is a nice alternative to std::fill, and really comes into its own if there if the assignment is a more complicated expression.
Related
I have a float vector. As I process certain data, I push it back.I always know what the size will be while declaring the vector.
For the largest case, it is 172,490,752 floats. This takes about eleven seconds just to push_back everything.
Is there a faster alternative, like a different data structure or something?
If you know the final size, then reserve() that size after you declare the vector. That way it only has to allocate memory once.
Also, you may experiment with using emplace_back() although I doubt it will make any difference for a vector of float. But try it and benchmark it (with an optimized build of course - you are using an optimized build - right?).
The usual way of speeding up a vector when you know the size beforehand is to call reserve on it before using push_back. This eliminates the overhead of reallocating memory and copying the data every time the previous capacity is filled.
Sometimes for very demanding applications this won't be enough. Even though push_back won't reallocate, it still needs to check the capacity every time. There's no way to know how bad this is without benchmarking, since modern processors are amazingly efficient when a branch is always/never taken.
You could try resize instead of reserve and use array indexing, but the resize forces a default initialization of every element; this is a waste if you know you're going to set a new value into every element anyway.
An alternative would be to use std::unique_ptr<float[]> and allocate the storage yourself.
::boost::container::stable_vector Notice that allocating a contiguous block of 172 *4 MB might easily fail and requires quite a lot page joggling. Stable vector is essentially a list of smaller vectors or arrays of reasonable size. You may also want to populate it in parallel.
You could use a custom allocator which avoids default initialisation of all elements, as discussed in this answer, in conjunction with ordinary element access:
const size_t N = 172490752;
std::vector<float, uninitialised_allocator<float> > vec(N);
for(size_t i=0; i!=N; ++i)
vec[i] = the_value_for(i);
This avoids (i) default initializing all elements, (ii) checking for capacity at every push, and (iii) reallocation, but at the same time preserves all the convenience of using std::vector (rather than std::unique_ptr<float[]>). However, the allocator template parameter is unusual, so you will need to use generic code rather than std::vector-specific code.
I have two answers for you:
As previous answers have pointed out, using reserve to allocate the storage beforehand can be quite helpful, but:
push_back (or emplace_back) themselves have a performance penalty because during every call, they have to check whether the vector has to be reallocated. If you know the number of elements you will insert already, you can avoid this penalty by directly setting the elements using the access operator []
So the most efficient way I would recommend is:
Initialize the vector with the 'fill'-constructor:
std::vector<float> values(172490752, 0.0f);
Set the entries directly using the access operator:
values[i] = some_float;
++i;
The reason push_back is slow is that it will need to copy all the data several times as the vector grows, and even when it doesn’t need to copy data it needs to check. Vectors grow quickly enough that this doesn’t happen often, but it still does happen. A rough rule of thumb is that every element will need to be copied on average once or twice; the earlier elements will need to be copied a lot more, but almost half the elements won’t need to be copied at all.
You can avoid the copying, but not the checks, by calling reserve on the vector when you create it, ensuring it has enough space. You can avoid both the copying and the checks by creating it with the right size from the beginning, by giving the number of elements to the vector constructor, and then inserting using indexing as Tobias suggested; unfortunately, this also goes through the vector an extra time initializing everything.
If you know the number of floats at compile time and not just runtime, you could use an std::array, which avoids all these problems. If you only know the number at runtime, I would second Mark’s suggestion to go with std::unique_ptr<float[]>. You would create it with
size_t size = /* Number of floats */;
auto floats = unique_ptr<float[]>{new float[size]};
You don’t need to do anything special to delete this; when it goes out of scope it will free the memory. In most respects you can use it like a vector, but it won’t automatically resize.
It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.
Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.
If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.
But if I iterate std::array addresses are in the same cache line so it should be faster.
I tested with VS2013 with full optimization and std::array is approx 20% faster.
Am I right in my assumptions?
Update: in order to not create the second similar topic. In this code I have an array and some local variable:
void test(array<int, 10>& arr)
{
int m{ 42 };
for (int i{ 0 }; i < arr.size(); ++i)
{
arr[i] = i * m;
}
}
In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?
Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.
You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.
Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.
So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.
Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.
Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.
I think it has nothing to do with cache miss.
You can take std::array as a wrapper of raw array, i.e. int arr[10], while vector as a wrapper of dynamic array, i.e. new int[10]. They should have the same performance. However, when you access vector, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: std::array is faster.
You can have a test that replacing std::array with int arr[10]. Although std::array is just a wrapper of int arr[10], you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing vector with new int[10], they should have equal performance.
For your second question, the local variable, i.e. m, will be saved in register (if optimized properly), and there will be no access to the memory location of m during the for loop. So it won't be a problem of cache miss either.
This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]
(4 answers)
Closed 9 years ago.
I have a piece of code that is really dirty.
I want to optimize it a little bit. Does it makes any difference when I take one of the following structures or are they identical with the point of view to performance in c++ ?
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
....
or
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
if
if ... else ...
if
if ... else ...
....
for end
Thanks in Advance!
Both are O(n). As we do not know the guts of the various for loops it is impossible to say.
BTW - Mark it as pseudo code and not C++
The 1st one may spend less time incrementing/testing i and conditionally branching (assuming the compiler's optimiser doesn't reduce it to the equivalent of the second one anyway), but with loop unrolling the time taken for the i loop may be insignificant compared to the time spent within the loop anyway.
Countering that, it's easily possible that the choice of separate versus combined loops will affect the ratio of cache hits, and that could significantly impact either version: it really depends on the code. For example, if each of the three if/else statements accessed different arrays at index i, then they'll be competing for CPU cache and could slow each other down. On the other hand, if they accessed the same array at index i, doing different steps in some calculation, then it's probably better to do all three steps while those memory pages are still in cache.
There are potential impacts other than caches - from impact to register allocation, speed of I/O devices (e.g. if each loop operates on lines/records from a distinct file on different physical drives, it's very probably faster to process some of each file in a loop, rather than sequentially process each file), etc..
If you care, benchmark your actual application with representative data.
Just from the structure of the loop it is not possible to say which approach will be faster.
Algorithmically, both has the same complexity O(n). However, both might have different performance numbers depending upon the kind of operation you are performing on the elements and the size of the container.
The size of container may have an impact on locality and hence the performance. So generally speaking, you would like to chew the data as much as you can, once you get it into the cache. So I would prefer the second approach. To get a clear picture you should actually measure the performance of you approach.
The second is only slightly more efficient than the first. You save:
Initialization of loop index
Calling size()
Comparing the loop index with the size()`
Incrementing the loop index
These are very minor optimizations. Do it if it doesn't impact readability.
I would expect the second approach to be at least marginally more optimal in most cases as it can leverage the locality of reference with respect to access to elements of the entity collection/set. Note that in the first approach, each for loop would need to start accessing elements from the beginning; depending on the size of the cache, the size of the list and the extent to which compiler can infer and optimize, this may lead to cache misses when a new for loop attempts to read an element even though that element would have been read already by a preceding loop.
I have do an extensive calculation on a big vector of integers. The vector size is not changed during the calculation. The size of the vector is frequently accessed by the code. What is faster in general: using the vector::size() function or using helper constant vectorSize storing the size of the vector?
I know that compilers usually able to inline the size() function when setting the proper compiler flags, however, making a function inline is something that a compiler may do but can not be forced.
Interesting question.
So, what's going to happened ? Well if you debug with gdb you'll see something like 3 member variables (names are not accurate):
_M_begin: pointer to the first element of the dynamic array
_M_end: pointer one past the last element of the dynamic array
_M_capacity: pointer one past the last element that could be stored in the dynamic array
The implementation of vector<T,Alloc>::size() is thus usually reduced to:
return _M_end - _M_begin; // Note: _Mylast - _Myfirst in VC 2008
Now, there are 2 things to consider when regarding the actual optimizations possible:
will this function be inlined ? Probably: I am no compiler writer, but it's a good bet since the overhead of a function call would dwarf the actual time here and since it's templated we have all the code available in the translation unit
will the result be cached (ie sort of having an unnamed local variable): it could well be, but you won't know unless you disassemble the generated code
In other words:
If you store the size yourself, there is a good chance it will be as fast as the compiler could get it.
If you do not, it will depend on whether the compiler can establish that nothing else is modifying the vector; if not, it cannot cache the variable, and will need to perform memory reads (L1) every time.
It's a micro-optimization. In general, it will be unnoticeable, either because the performance does not matter or because the compiler will perform it regardless. In a critical loop where the compiler does not apply the optimization, it can be a significant improvement.
As I understand the 1998 C++ specification, vector<T>::size() takes constant time, not linear time. So, this question likely boils down to whether it's faster to read a local variable than calling a function that does very little work.
I'd therefore claim that storing your vector's size() in a local variable will speed up your program by a small amount, since you'll only call that function (and therefore the small constant amount of time it takes to execute) once instead of many times.
Performance of vector::size() : is it
as fast as reading a variable?
Probably not.
Does it matter
Probably not.
Unless the work you're doing per iteration is tiny (like one or two integer operations) the overhead will be insignificant.
In every implementation I've, seen vector::size() performs a subtraction of end() and begin(), ie its not as fast as reading a variable.
When implementing a vector, the implementer has to make a choice between which shall be fastest, end() or size(), ie storing the number of initialized elements or the pointer/iterator to the element after the last initialized element.
In other words; iterate by using iterators.
If you are worried of the size() performance, write your index based for loop like this;
for (size_t i = 0, i_end = container.size(); i < i_end; ++i){
// do something performance critical
}
I always save vector.size() in a local variable (if the the size doesn't change inside the loop!).
Calling it on each iteration vs. saving it in a local variable can be faster.
At least, that's what I experienced.
I can't give you any real numbers, as I tested this a very long time ago. However from what I can recall, it made a noticeable difference (however potentially only in debug mode), especially when nesting loops.
And to all the people complaining about micro-optimization:
It's a single additional line of code that introduces no downsides.
You could write yourself a functor for your loop body and call it via std::for_each. It does the iteration for you, and then your question becomes moot. However, you're introducing a function call (that may or may not get inlined) for every loop iteration, so you'd best profile it if you're not getting the performance you expect.
Always get a profile of your application before looking at this sort of micro optimization. Remember that even if it performs a subtraction, the compiler could still easily optimize it in many ways that would negate any performance loss.
In scripting languages like PHP having a for loop like this would be a very bad idea:
string s("ABCDEFG");
int i;
for( i = 0; i < s.length(); i ++ )
{
cout << s[ i ];
}
This is an example, i'm not building a program like this. (For the guys that feel like they have to tell me why this piece of code <insert bad thing about it here>)
If this C++ example was translated to a similar PHP script the lenght of the string would be calculated every loop cycle. That would cause an enormous perfomance loss in realistic scripts.
I thought the same would apply to C++ programs but when I take a look at tutorials, several open-source libraries and other pieces of code I see that the limiter for the loop isn't precalculated.
Should I precalculate the lenght of the string s?
Why isn't the limiter always precalculated? (seen this in tutorials and examples)
Is there some sort of optimization done by the compiler?
It's all relative.
PHP is interpreted, but if s.length drops into a compiled part of the PHP interpreter, it will not be slow. But even if it is slow, what about the time spent in s[i], and what about the time spent in cout <<?
It's really easy to focus on loop overhead while getting swamped with other stuff.
Like if you wrote this in C++, and cout were writing to the console, do you know what would dominate? cout would, far and away, because that innocent-looking << operator invokes a huge pile of library code and system routines.
You should learn to justify simpler code. Try to convince yourself that sooner or later you will replace string::length implementation to more optimized one. (Even though your project will most likely miss all deadlines, and optimizing string::length will be the least of your problems.) This kind of thinking will help you focus on things that really matter, even though it's not always easy...
It depends on how the string is implemented.
On null terminated strings you have to calculate the size on every iteration.
std::string is a container and the size should be returned in O(1) time,
it depends (again) on the implementation.
The optimizer may indeed be able to optimize the call to length away if he's able to determine that its value won't change - nevertheless, you're on the safe side if you precalculate it (in many cases, however, optimization won't be possible because it's not clear to the compiler whether the condition variable could possible be changed during the loop).
In many cases, it just doesn't matter because the loop in question is not performance-relevant. Using the classic for(int i=0; i < somewhat(); ++i) is both less work to type and easier to read than for(int i=0,end=somewhat(); i < end; ++i.
Note that the C++ compiler will usually inline small functions, such as length (which would usually retrieve a precalculated length from the string object). Interpreted scripting languages usually need a dictionary lookup for a function call, so for C++ the relative overhad of the redundant check once per loop iteration is probably much smaller.
You're correct, s.length() will normally be evaluated on every loop iteration. You're better off writing:
size_t len = s.length();
for (size_t i = 0; i < len; ++i) {
...
}
Instead of the above. That said, for a loop with only a few iterations, it doesn't really matter how often the call to length() will be made.
I don't know about php but I can tell what c++ does.
Consider:
std::string s("Rajendra");
for (unsigned int i = 0; i < s.length(); i++)
{
std::cout << s[i] << std::endl;
}
If you go for looking up definition of length() (right click on length() and click on "Go To Definition") OR if you are using Visual Assist X then placing the cursor on length() and press Alt+G, you will find following:
size_type __CLR_OR_THIS_CALL length() const
{ // return length of sequence
return (_Mysize);
}
Where _Mysize is of type int, which clearly reveals that length of the string is pre-calculated and only stored value is being returned each call to length().
However,IMPO (in my personal opinion), this coding style is bad and should be best avoided. I would prefer following:
std::string s("Rajendra");
int len = s.length();
for (unsigned int i = 0; i < len; i++)
{
std::cout << s[i] << std::endl;
}
This way, you will save the overhead of calling length() function equal to length of the string number of times, which saves pushing and popping of stack frame. This can be very expensive when your string is large.
Hope that helps.
Probably.
For readability.
Sometimes. It depends on how good it is at detecting that the length will not change inside the loop.
Short answer, because there are situations where you want it called each time.
someone else's explanation: http://bytes.com/topic/c/answers/212351-loop-condition-evaluation
Well - as this is a very common scenario, most compilers will precalculate the value. Especially when looping through arrays and very common types - string might be one of them.
In addition, introducing an additional variable might destroy some other loop optimizations - it really depends on the compiler you use and might change from version to version.
Thus in some scenarious, the "optimization" could backfire.
If the code is non real "hot-spot" where every tick of performance does matter, you should write it as you did: No "manual" precalculation.
Readability is also very important when writing code! Optimizations should be done very carefully and only after intensive profiling!
std::string.length() returns fixed variable, that stores in container. It is already precalculated
In this particular case, std::string.length() is usually (but not necessarily) a constant-time operation and usually pretty efficient.
For the general case, the loop termination condition can be any expression, not just a comparison of the loop index variable (indeed, C/C++ does not recognize any particular index, just an initializer expression, a loop test expression and a loop counter expression (which just is executed every time through). The C/C++ for loop is basically syntactic sugar for a do/while compound statement.
The compiler may be able to save the result of the call and optimize away all of the extra function calls, but it may not. However, the cost of the function call will be quite low since all it has to do is return an int. On top of that, there's a good chance that it will be inlined, removing the cost of the function call altogether.
However, if you really care, you should profile your code and see whether precalculating the value makes it any faster. But it wouldn't hurt to just choose to precalculate it anyway. It won't cost you anything. Odds are, in most cases though, it doesn't matter. There are some containers - like list - where size() might not be an O(1) operation, and then precalculating would be a really good idea, but for most it probably doesn't matter much - especially if the contents of your loop are enough to overwhelm the cost of such an efficient function call. For std::string, it should be O(1), and will probably be optimized away, but you'd have to test to be sure - and of course things like the optimization level that you compile at could change the results.
It's safer to precalculate but often not necessary.
std::sting::length() returns a precalculated value. Other stl containers recalculate their size every time you call the method size()
e.g
std::list::size() recalculates the size and
std::vector::size() returns a
precalculated value
It depends on how the internal storage of the container is implemented.
std::vector is an array with capacity 2^n and std::list is an linked list.
You could precalculate the length of the string only if you KNOW the string won't ever change inside the loop.
I don't know why it is done this way in tutorials. A few guesses:
1) To get you in the habit so you don't get hosed when you are changing the value of the string inside the loop.
2) Because it is easier to understand.
Yes, the optimizer will try to improve this, if it can determine if the string won't change
Just for information, on my computer, g++ 4.4.2, with -O3, with the given piece of code, the function std::string::length() const is called 8 times.
I agree it's precalculated and the function is very likely to be inlined. Still very interesting to know when you are using your own functions/class.
If your program performs a lot of operations on strings all the time like copying strings ( with memcpy ) then it makes sense to cache the length of the string.
A good example of this I have seen in opensource code is redis.
In sds.h ( Simple Dynamic Strings ) look at sdshdr structure:
struct sdshdr {
long len;
long free;
char buf[];
};
As you can see it caches the length ( in len field ) of the character array buf and also the free space available ( in free field ) after the null character.
So the total memory allocated to buf is
len + free
This saves a realloc in case the buf needs to be modified and it fits in the space already available.