Why is std::vector slower than an array? [duplicate] - c++

This question already has answers here:
Performance: memset
(2 answers)
Why might std::vector be faster than a raw dynamically allocated array?
(2 answers)
Why is iterating though `std::vector` faster than iterating though `std::array`?
(2 answers)
Idiomatic way of performance evaluation?
(1 answer)
Closed 3 years ago.
When I run the following program (with optimization on), the for loop with the std::vector takes about 0.04 seconds while the for loop with the array takes 0.0001 seconds.
#include <iostream>
#include <vector>
#include <chrono>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "The vector took " << elapsed.count() << "seconds\n";
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
finish = std::chrono::high_resolution_clock::now();
elapsed = finish - start;
std::cout << "The array took " << elapsed.count() << "seconds \n";
char s;
std::cin >> s;
delete[] Data;
}
The code is a simplified version of a performance issue I was having while writing a raycaster. The len variable corresponds to how many times the loop in the original program needs to run (400 pixels * 400 pixels * 50 maximum render distance). For complicated reasons (perhaps that I don't fully understand how to use arrays) I have to use a vector rather than an array in the actual raycaster. However, as this program demonstrates, that would only give me 20 frames per second as opposed to the envied 10,000 frames per second that using an array would supposedly give me (obviously, this is just a simplified performance test). But regardless of how accurate those numbers are, I still want to boost my frame rate as much as possible. So, why is the vector performing so much slower than the array? Is there a way to speed it up? Thanks for your help. And if there's anything else I'm doing weirdly that might be affecting performance, please let me know. I didn't even know about optimization until researching an answer for this question, so if there are any more things like that which might boost the performance, please let me know (and I'd prefer if you explained where those settings are in the properties manager rather than command line since I don't yet know how to use the command line)

Let us observe how GCC optimizes this test program:
#include <vector>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
delete[] Data;
}
The compiler rightly notices that the vector is constant, and eliminates it. Exactly same code is generated for both loops. Therefore it should be irrelevant whether the first loop uses array or vector.
.L2:
movups XMMWORD PTR [rcx], xmm0
add rcx, 16
cmp rsi, rcx
jne .L2
What makes difference in your test program is the order of loops. The comments point out that when a third loop is added to the beginning, both loops take the same time.
I would expect that with a modern compiler accessing a vector would be approximately as fast as accessing an array, when optimization is enabled and debug is disabled. If there is an observable difference in your actual program, the problem lies somewhere else.

It is about caches. I dont know how it works detailed but Data[] is getting known better by cpu while it is used. If you reverse the order of calculation you can see 'vector is faster'.
But actually, you are testing neither vector nor array. Let's assume that vec[0] resides at 0x01 memory location, arr[0] resides at 0xf1. Only difference is reading a word from different single memory adresses. So you are testing how fast can I assign a value to elements of dynamically allocated array.
Note: std::chrono::high_resolution_clock might not be sufficient to measure ticks. It is better to use steady_clock as cppreference says.

Related

Doesn't see any significant improvement while using parallel block in OpenMP C++

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.
Please see the code snippet below-
#define COUNT 4
while (all_ok())
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
// at each iteration, new data is filled
// in 'trans' and 'in_data' variables
#pragma omp parallel num_threads(COUNT)
{
#pragma omp for
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_clouds[i];
}
}
Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.
Any suggestions? Any alternatives to perform the same operation, please?
Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!
Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:
#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;
const int COUNT = 4;
EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
#pragma omp parallel for num_threads(COUNT)
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_data[i];
}
int main()
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
int n = 500000;
for (int i = 0; i < COUNT; i++)
{
trans[i].setRandom();
in_data[i].setRandom(4,n);
out_data[i].setRandom(4,n);
}
int tries = 3;
int rep = 1;
BenchTimer t;
BENCH(t, tries, rep, foo(trans, in_data, out_data));
std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
return 0;
}
So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.
Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.
For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.
You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here:
https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp
The code does the whole shebang for float/double real and complex.

running speed of permutation function using different methods results in unexpected results

I have implemented a isPermutation function which given two string will return true if the two are permutation of each other, otherwise it will return false.
One uses c++ sort algorithm twice, while the other uses an array of ints to keep track of string count.
I ran the code several times and every time the sorting method is faster. Is my array implementation wrong?
Here is the output:
1
0
1
Time: 0.088 ms
1
0
1
Time: 0.014 ms
And the code:
#include <iostream> // cout
#include <string> // string
#include <cstring> // memset
#include <algorithm> // sort
#include <ctime> // clock_t
using namespace std;
#define MAX_CHAR 255
void PrintTimeDiff(clock_t start, clock_t end) {
std::cout << "Time: " << (end - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
}
// using array to keep a count of used chars
bool isPermutation(string inputa, string inputb) {
int allChars[MAX_CHAR];
memset(allChars, 0, sizeof(int) * MAX_CHAR);
for(int i=0; i < inputa.size(); i++) {
allChars[(int)inputa[i]]++;
}
for (int i=0; i < inputb.size(); i++) {
allChars[(int)inputb[i]]--;
if(allChars[(int)inputb[i]] < 0) {
return false;
}
}
return true;
}
// using sorting anc comparing
bool isPermutation_sort(string inputa, string inputb) {
std::sort(inputa.begin(), inputa.end());
std::sort(inputb.begin(), inputb.end());
if(inputa == inputb) return true;
return false;
}
int main(int argc, char* argv[]) {
clock_t start = clock();
cout << isPermutation("god", "dog") << endl;
cout << isPermutation("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
start = clock();
cout << isPermutation_sort("god", "dog") << endl;
cout << isPermutation_sort("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation_sort("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
return 0;
}
To benchmark this you have to eliminate all the noise you can.
The easiest way to do this is to wrap it in a loop that repeats the call to each 1000 times or so, then only spit out the value every 10 iterations. This way they each have a similar caching profile. Throw away values that are bogus due (eg blowouts due to context switches by the OS).
I got your method marginally faster by doing this. An excerpt.
method 1 array Time: 0.768 us
method 2 sort Time: 0.840333 us
method 1 array Time: 0.621333 us
method 2 sort Time: 0.774 us
method 1 array Time: 0.769 us
method 2 sort Time: 0.856333 us
method 1 array Time: 0.766 us
method 2 sort Time: 0.850333 us
method 1 array Time: 0.802667 us
method 2 sort Time: 0.89 us
method 1 array Time: 0.778 us
method 2 sort Time: 0.841333 us
I used rdtsc which works better for me on this system. 3000 cycles per microsecond is close enough for this, but please do make it more accurate if you care about precision of the readings.
#if defined(__x86_64__)
static uint64_t rdtsc()
{
uint64_t hi, lo;
__asm__ __volatile__ (
"xor %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a"(lo), "=d"(hi)
:: "ebx", "ecx");
return (hi << 32)|lo;
}
#else
#error wrong architecture - implement me
#endif
void PrintTimeDiff(uint64_t start, uint64_t end) {
std::cout << "Time: " << (end - start)/double(3000) << " us" << std::endl;
}
you cannot check performance differences between implementations putting in the mix calls to std::cout. isPermutation and isPermutation_sort are some order of magnitude faster than a call to std::cout (and, anyway, prefer \n over std::endl).
for testing you have to activate compiler optimizations. Doing so the compiler will apply the loop-invariant code motion optimization and you'll probably get the same results for both implementations.
A more effective way of testing is:
int main()
{
const std::vector<std::string> bag
{
"god", "dog", "thisisaratherlongerinput", "thisisarathershorterinput",
"armen", "ramen"
};
static std::mt19937 engine;
std::uniform_int_distribution<std::size_t> rand(0, bag.size() - 1);
const unsigned stop = 1000000;
unsigned counter = 0;
std::clock_t start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
counter = 0;
start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation_sort(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
return 0;
}
I have 2.4s for isPermutations_sort vs 2s for isPermutation (somewhat similar to Hal's results). Same with g++ and clang++.
Printing the value of counter has the double benefit of:
triggering the as-if rule (the compiler cannot remove the for loops);
allowing a first check of your implementations (the two values cannot be too distant).
There're some things you have to change in your implementation of isPermutation:
pass arguments as const references
bool isPermutation(const std::string &inputa, const std::string &inputb)
just this change brings the time down to 0.8s (of course you cannot do the same with isPermutation_sort).
you can use std::array and std::fill instead of memset (this is C++ :-)
avoid premature pessimization and prefer preincrement. Only use postincrement if you're going to use the original value
do not mix signed and unsigned value in the for loops (inputa.size() and i). i should be declared as std::size_t
even better, use the range based for loop.
So something like:
bool isPermutation(const std::string &inputa, const std::string &inputb)
{
std::array<int, MAX_CHAR> allChars;
allChars.fill(0);
for (auto c : inputa)
++allChars[(unsigned char)c];
for (auto c : inputb)
{
--allChars[(unsigned char)c];
if (allChars[(unsigned char)c] < 0)
return false;
}
return true;
}
Anyway both isPermutation and isPermutation_sort should have this preliminary check:
if (inputa.length() != inputb.length())
return false;
Now we are at 0.55s for isPermutation vs 1.1s for isPermutation_sort.
Last but not least consider std::is_permutation:
for (unsigned i(0); i < stop; ++i)
{
const std::string &s1(bag[rand(engine)]), &s2(bag[rand(engine)]);
counter += std::is_permutation(s1.begin(), s1.end(), s2.begin());
}
(0.6s)
EDIT
As observed in BeyelerStudios' comment Mersenne-Twister is too much in this case.
You can change the engine to a simpler one.:
static std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> engine;
This further lowers the timings. Luckily the relative speeds remain the same.
Just to be sure I've also checked with a non random access scheme obtaining the same relative results.
Your idea amounts to using a Counting Sort on both strings, but with the comparison happening on the count array, rather than after writing out sorted strings.
It works well because a byte can only have one of 255 non-zero values. Zeroing 256B of memory, or even 4*256B, is pretty cheap, so it works well even for fairly short strings, where most of the count array isn't touched.
It should be fairly good for very long strings, at least in some cases. It's pretty heavily dependent on a good and a heavily pipelined L1 cache, because scattered increments to the count array produces scattered read-modify-writes. Repeated occurrences create a dependency chain of with a store-load round-trip in it. This is a big glass-jaw for this algorithm, on CPUs where many loads and stores can be in flight at once (with their latencies happening in parallel). Modern x86 CPUs should run it pretty well, since they can sustain a load + store every clock cycle.
The initial count of inputa compiles to a very tight loop:
.L15:
movsx rdx, BYTE PTR [rax]
add rax, 1
add DWORD PTR [rsp-120+rdx*4], 1
cmp rax, rcx
jne .L15
This brings us to the first major bug in your code: char can be signed or unsigned. In the x86-64 ABI, char is signed, so allChars[(int)inputa[i]]++; sign-extends it for use as an array index. (movsx instead of movzx). Your code will write outside the array bounds on non-ASCII characters that have their high bit set. So you should have written allChars[(unsigned char)inputa[i]]++;. Note that casting to (unsigned) doesn't give the result we want (see comments).
Note that clang makes much worse code (v3.7.1 and v3.8, both with -O3), with a function call to std::basic_string<...>::_M_leak_hard() inside the inner loop. (Leak as in leak a reference, I think.) #manlio's version doesn't have this problem, so I guess for (auto c : inputa) syntax helps clang figure out what's happening.
Also, using std::string when your callers have char[] forces them to construct a std::string. That's kind of silly, but it is helpful to be able to compare string lengths.
GNU libc's std::is_permutation uses a very different strategy:
First, it skips any common prefix that's identical without permutation in both strings.
Then, for each element in inputa:
count the occurrences of that element in inputb. Check that it matches the count in inputa.
There are a couple optimizations:
Only compare counts the first time an element is seen: find duplicates by searching from the beginning of inputa, and if the match position isn't the current position, we've already checked this element.
check that the match count in inputb is != 0 before counting matches in the rest of inputa.
This doesn't need any temporary storage, so it can work when the elements are large. (e.g. an array of int64_t, or an array of structs).
If there is a mismatch, this is likely to find it early, before doing as much work. There are probably a few cases of inputs where the counting version would take less time, but probably for most inputs the library algorithm is best.
std::is_permutation uses std::count, which should be implemented very well with SSE / AVX vectors. Unfortunately, it's auto-vectorized in a really stupid way by both gcc and clang. It unpacks bytes to 64bit integers before accumulating them in vector elements, to avoid overflow. So it spends most of its instructions shuffling data around, and is probably slower than a scalar implementation (which you'd get from compiling with -O2, or with -O3 -fno-tree-vectorize).
It could and should only do this every few iterations, so the inner loop of count can just be something like pcmpeqb / psubb, with a psadbw every 255 iterations. Or pcmpeqb / pmovmskb / popcnt / add, but that's slower.
Template specializations in the library could help a lot for std::count for 8, 16, and 32bit types whose equality can be checked with bitwise equality (integer ==).

C++ Array vs vector

when using C++ vector, time spent is 718 milliseconds,
while when I use Array, time is almost 0 milliseconds.
Why so much performance difference?
int _tmain(int argc, _TCHAR* argv[])
{
const int size = 10000;
clock_t start, end;
start = clock();
vector<int> v(size*size);
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
v[i*size+j] = 1;
}
}
end = clock();
cout<< (end - start)
<<" milliseconds."<<endl; // 718 milliseconds
int f = 0;
start = clock();
int arr[size*size];
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
arr[i*size+j] = 1;
}
}
end = clock();
cout<< ( end - start)
<<" milliseconds."<<endl; // 0 milliseconds
return 0;
}
Your array arr is allocated on the stack, i.e., the compiler has calculated the necessary space at compile time. At the beginning of the method, the compiler will insert an assembler statement like
sub esp, 10000*10000*sizeof(int)
which means the stack pointer (esp) is decreased by 10000 * 10000 * sizeof(int) bytes to make room for an array of 100002 integers. This operation is almost instant.
The vector is heap allocated and heap allocation is much more expensive. When the vector allocates the required memory, it has to ask the operating system for a contiguous chunk of memory and the operating system will have to perform significant work to find this chunk of memory.
As Andreas says in the comments, all your time is spent in this line:
vector<int> v(size*size);
Accessing the vector inside the loop is just as fast as for the array.
For an additional overview see e.g.
[What and where are the stack and heap?
[http://computer.howstuffworks.com/c28.htm][2]
[http://www.cprogramming.com/tutorial/virtual_memory_and_heaps.html][3]
Edit:
After all the comments about performance optimizations and compiler settings, I did some measurements this morning. I had to set size=3000 so I did my measurements with roughly a tenth of the original entries. All measurements performed on a 2.66 GHz Xeon:
With debug settings in Visual Studio 2008 (no optimization, runtime checks, and debug runtime) the vector test took 920 ms compared to 0 ms for the array test.
98,48 % of the total time was spent in vector::operator[], i.e., the time was indeed spent on the runtime checks.
With full optimization, the vector test needed 56 ms (with a tenth of the original number of entries) compared to 0 ms for the array.
The vector ctor required 61,72 % of the total application running time.
So I guess everybody is right depending on the compiler settings used. The OP's timing suggests an optimized build or an STL without runtime checks.
As always, the morale is: profile first, optimize second.
If you are compiling this with a Microsoft compiler, to make it a fair comparison you need to switch off iterator security checks and iterator debugging, by defining _SECURE_SCL=0 and _HAS_ITERATOR_DEBUGGING=0.
Secondly, the constructor you are using initialises each vector value with zero, and you are not memsetting the array to zero before filling it. So you are traversing the vector twice.
Try:
vector<int> v;
v.reserve(size*size);
To get a fair comparison I think something like the following should be suitable:
#include <sys/time.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include <numeric>
int main()
{
static size_t const size = 7e6;
timeval start, end;
int sum;
gettimeofday(&start, 0);
{
std::vector<int> v(size, 1);
sum = std::accumulate(v.begin(), v.end(), 0);
}
gettimeofday(&end, 0);
std::cout << "= vector =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
gettimeofday(&start, 0);
int * const arr = new int[size];
std::fill(arr, arr + size, 1);
sum = std::accumulate(arr, arr + size, 0);
delete [] arr;
gettimeofday(&end, 0);
std::cout << "= Simple array =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
}
In both cases, dynamic allocation and deallocation is performed, as well as accesses to elements.
On my Linux box:
$ g++ -O2 foo.cpp
$ ./a.out
= vector =
(0 s, 21085 us)
sum = 7000000
= Simple array =
(0 s, 21148 us)
sum = 7000000
Both the std::vector<> and array cases have comparable performance. The point is that std::vector<> can be just as fast as a simple array if your code is structured appropriately.
On a related note switching off optimization makes a huge difference in this case:
$ g++ foo.cpp
$ ./a.out
= vector =
(0 s, 120357 us)
sum = 7000000
= Simple array =
(0 s, 60569 us)
sum = 7000000
Many of the optimization assertions made by folks like Neil and jalf are entirely correct.
HTH!
EDIT: Corrected code to force vector destruction to be included in time measurement.
Change assignment to eg. arr[i*size+j] = i*j, or some other non-constant expression. I think compiler optimizes away whole loop, as assigned values are never used, or replaces array with some precalculated values, so that loop isn't even executed and you get 0 milliseconds.
Having changed 1 to i*j, i get the same timings for both vector and array, unless pass -O1 flag to gcc, then in both cases I get 0 milliseconds.
So, first of all, double-check whether your loops are actually executed.
You are probably using VC++, in which case by default standard library components perform many checks at run-time (e.g whether index is in range). These checks can be turned off by defining some macros as 0 (I think _SECURE_SCL).
Another thing is that I can't even run your code as is: the automatic array is way too large for the stack. When I make it global, then with MingW 3.5 the times I get are 627 ms for the vector and 26875 ms (!!) for the array, which indicates there are really big problems with an array of this size.
As to this particular operation (filling with value 1), you could use the vector's constructor:
std::vector<int> v(size * size, 1);
and the fill algorithm for the array:
std::fill(arr, arr + size * size, 1);
Two things. One, operator[] is much slower for vector. Two, vector in most implementations will behave weird at times when you add in one element at a time. I don't mean just that it allocates more memory but it does some genuinely bizarre things at times.
The first one is the main issue. For a mere million bytes, even reallocating the memory a dozen times should not take long (it won't do it on every added element).
In my experiments, preallocating doesn't change its slowness much. When the contents are actual objects it basically grinds to a halt if you try to do something simple like sort it.
Conclusion, don't use stl or mfc vectors for anything large or computation heavy. They are implemented poorly/slowly and cause lots of memory fragmentation.
When you declare the array, it lives in the stack (or in static memory zone), which it's very fast, but can't increase its size.
When you declare the vector, it assign dynamic memory, which it's not so fast, but is more flexible in the memory allocation, so you can change the size and not dimension it to the maximum size.
When profiling code, make sure you are comparing similar things.
vector<int> v(size*size);
initializes each element in the vector,
int arr[size*size];
doesn't. Try
int arr[size * size];
memset( arr, 0, size * size );
and measure again...

Which one will be faster

Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.
There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).
Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.
In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.
The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.
For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);
The first one will be faster because you loop from 1 to 10000 only one time.
C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).
If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.
The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.
If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?

Counting down in for-loops

I believe (from some research reading) that counting down in for-loops is actually more efficient and faster in runtime. My full software code is C++
I currently have this:
for (i=0; i<domain; ++i) {
my 'i' is unsigned resgister int,
also 'domain' is unsigned int
in the for-loop i is used for going through an array, e.g.
array[i] = do stuff
converting this to count down messes up the expected/correct output of my routine.
I can imagine the answer being quite trivial, but I can't get my head round it.
UPDATE: 'do stuff' does not depend on previous or later iteration. The calculations within the for-loop are independant for that iteration of i. (I hope that makes sense).
UPDATE: To achieve a runtime speedup with my for-loop, do I count down and if so remove the unsigned part when delcaring my int, or what other method?
Please help.
There is only one correct method of looping backwards using an unsigned counter:
for( i = n; i-- > 0; )
{
// Use i as normal here
}
There's a trick here, for the last loop iteration you will have i = 1 at the top of the loop, i-- > 0 passes because 1 > 0, then i = 0 in the loop body. On the next iteration i-- > 0 fails because i == 0, so it doesn't matter that the postfix decrement rolled over the counter.
Very non obvious I know.
I'm guessing your backward for loop looks like this:
for (i = domain - 1; i >= 0; --i) {
In that case, because i is unsigned, it will always be greater than or equal to zero. When you decrement an unsigned variable that is equal to zero, it will wrap around to a very large number. The solution is either to make i signed, or change the condition in the for loop like this:
for (i = domain - 1; i >= 0 && i < domain; --i) {
Or count from domain to 1 rather than from domain - 1 to 0:
for (i = domain; i >= 1; --i) {
array[i - 1] = ...; // notice you have to subtract 1 from i inside the loop now
}
This is not an answer to your problem, because you don't seem to have a problem.
This kind of optimization is completely irrelevant and should be left to the compiler (if done at all).
Have you profiled your program to check that your for-loop is a bottleneck? If not, then you do not need to spend time worrying about this. Even more so, having "i" as a "register" int, as you write, makes no real sense from a performance standpoint.
Even without knowing your problem domain, I can guarantee you that both the reverse-looping technique and the "register" int counter will have negligible impact on your program's performance. Remember, "Premature optimization is the root of all evil".
That said, better spent optimization time would be on thinking about the overall program structure, data structures and algorithms used, resource utilization, etc.
Checking to see if a number is zero can be quicker or more efficient than a comparison. But this is the sort of micro-optimization you really shouldn't worry about - a few clock cycles will be greatly dwarfed by just about any other perf issue.
On x86:
dec eax
jnz Foo
Instead of:
inc eax
cmp eax, 15
jl Foo
It has nothing to do with counting up or down. What can be faster is counting toward zero. Michael's answer shows why — x86 gives you a comparison with zero as an implicit side effect of many instructions, so after you adjust your counter, you just branch based on the result instead of doing an explicit comparison. (Maybe other architectures do that, too; I don't know.)
Borland's Pascal compilers are notorious for performing that optimization. The compiler transforms this code:
for i := x to y do
foo(i);
into an internal representation more akin to this:
tmp := Succ(y - x);
i := x;
while tmp > 0 do begin
foo(i);
Inc(i);
Dec(tmp);
end;
(I say notorious not because the optimization affects the outcome of the loop, but because the debugger displays the counter variable incorrectly. When the programmer inspects i, the debugger may display the value of tmp instead, causing no end of confusion and panic for programmers who think their loops are running backward.)
The idea is that even with the extra Inc or Dec instruction, it's still a net win, in terms of running time, over doing an explicit comparison. Whether you can actually notice that difference is up for debate.
But note that the conversion is something the compiler would do automatically, based on whether it deemed the transformation worthwhile. The compiler is usually better at optimizing code than you are, so don't spend too much effort competing with it.
Anyway, you asked about C++, not Pascal. C++ "for" loops aren't quite as easy to apply that optimization to as Pascal "for" loops are because the bounds of Pascal's loops are always fully calculated before the loop runs, whereas C++ loops sometimes depend on the stopping condition and the loop contents. C++ compilers need to do some amount of static analysis to determine whether any given loop could fit the requirements for the kind of transformation Pascal loops qualify for unconditionally. If the C++ compiler does the analysis, then it could do a similar transformation.
There's nothing stopping you from writing your loops that way on your own:
for (unsigned i = 0, tmp = domain; tmp > 0; ++i, --tmp)
array[i] = do stuff
Doing that might make your code run faster. Like I said before, though, you probably won't notice. The bigger cost you pay by manually arranging your loops like that is that your code no longer follows established idioms. Your loop is a perfectly ordinary "for" loop, but it no longer looks like one — it has two variables, they're counting in opposite directions, and one of them isn't even used in the loop body — so anyone reading your code (including you, a week, a month, or a year from now when you've forgotten the "optimization" you were hoping to achieve) will need to spend extra effort proving to himself or herself that the loop is indeed an ordinary loop in disguise.
(Did you notice that my code above used unsigned variables with no danger of wrapping around at zero? Using two separate variables allows that.)
Three things to take away from all this:
Let the optimizer do its job; on the whole it's better at it than you are.
Make ordinary code look ordinary so that the special code doesn't have to compete to get attention from people reviewing, debugging, or maintaining it.
Don't do anything fancy in the name of performance until testing and profiling show it to be necessary.
If you have a decent compiler, it will optimize "counting up" just as effectively as "counting down". Just try a few benchmarks and you'll see.
So you "read" that couting down is more efficient? I find this very difficult to believe unless you show me some profiler results and the code. I can buy it under some circumstances, but in the general case, no. Seems to me like this is a classic case of premature optimization.
Your comment about "register int i" is also very telling. Nowadays, the compiler always knows better than you how to allocate registers. Don't bother using using the register keyword unless you have profiled your code.
When you're looping through data structures of any sort, cache misses have a far bigger impact than the direction you're going. Concern yourself with the bigger picture of memory layout and algorithm structure instead of trivial micro-optimisations.
You may try the following, which compiler will optimize very efficiently:
#define for_range(_type, _param, _A1, _B1) \
for (_type _param = _A1, _finish = _B1,\
_step = static_cast<_type>(2*(((int)_finish)>(int)_param)-1),\
_stop = static_cast<_type>(((int)_finish)+(int)_step); _param != _stop; \
_param = static_cast<_type>(((int)_param)+(int)_step))
Now you can use it:
for_range (unsigned, i, 10,0)
{
cout << "backwards i: " << i << endl;
}
for_range (char, c, 'z','a')
{
cout << c << endl;
}
enum Count { zero, one, two, three };
for_range (Count, c, three, zero)
{
cout << "backwards: " << c << endl;
}
You may iterate in any direction:
for_range (Count, c, zero, three)
{
cout << "forward: " << c << endl;
}
The loop
for_range (unsigned,i,b,a)
{
// body of the loop
}
will produce the following code:
mov esi,b
L1:
; body of the loop
dec esi
cmp esi,a-1
jne L1
Hard to say with information given but... reverse your array, and count down?
Jeremy Ruten rightly pointed out that using an unsigned loop counter is dangerous. It's also unnecessary, as far as I can tell.
Others have also pointed out the dangers of premature optimization. They're absolutely right.
With that said, here is a style I used when programming embedded systems many years ago, when every byte and every cycle did count for something. These forms were useful for me on the particular CPUs and compilers that I was using, but your mileage may vary.
// Start out pointing to the last elem in array
pointer_to_array_elem_type p = array + (domain - 1);
for (int i = domain - 1; --i >= 0 ; ) {
*p-- = (... whatever ...)
}
This form takes advantage of the condition flag that is set on some processors after arithmetical operations -- on some architectures, the decrement and testing for the branch condition can be combined into a single instruction. Note that using predecrement (--i) is the key here -- using postdecrement (i--) would not have worked as well.
Alternatively,
// Start out pointing *beyond* the last elem in array
pointer_to_array_elem_type p = array + domain;
for (pointer_to_array_type p = array + domain; p - domain > 0 ; ) {
*(--p) = (... whatever ...)
}
This second form takes advantage of pointer (address) arithmetic. I rarely see the form (pointer - int) these days (for good reason), but the language guarantees that when you subtract an int from a pointer, the pointer is decremented by (int * sizeof (*pointer)).
I'll emphasize again that whether these forms are a win for you depends on the CPU and compiler that you're using. They served me well on Motorola 6809 and 68000 architectures.
In some later arm cores, decrement and compare takes only a single instruction. This makes decrementing loops more efficient than incrementing ones.
I don't know why there isn't an increment-compare instruction also.
I'm surprised that this post was voted -1 when it's a true issue.
Everyone here is focusing on performance. There is actually a logical reason to iterate towards zero that can result in cleaner code.
Iterating over the last element first is convenient when you delete invalid elements by swapping with the end of the array. For bad elements not adjacent to the end we can swap into the end position, decrease the end bound of the array, and keep iterating. If you were to iterate toward the end then swapping with the end could result in swapping bad for bad. By iterating end to 0 we know that the element at the end of the array has already been proven valid for this iteration.
For further explanation...
If:
You delete bad elements by swapping with one end of the array and changing the array bounds to exclude the bad elements.
Then obviously:
You would swap with a good element i.e. one that has already been tested in this iteration.
So this implies:
If we iterate away from the variable bound then elements between the variable bound and the current iteration pointer have been proven good. Whether the iteration pointer gets ++ or -- doesn't matter. What matters is that we're iterating away from the variable bound so we know that the elements adjacent to it are good.
So finally:
Iterating towards 0 allows us to use only one variable to represent the array bounds. Whether this matters is a personal decision between you and your compiler.
What matters much more than whether you're increasing or decreasing your counter is whether or not you're going up memory or down memory. Most caches are optimized for going up memory, not down memory. Since memory access time is the bottleneck that most programs today face, this means that changing your program so that you go up memory can result in a performance boost even if this requires comparing your counter to a non-zero value. In some of my programs, I saw a significant improvement in performance by changing my code to go up memory instead of down it.
Skeptical? Here's the output that I got:
sum up = 705046256
sum down = 705046256
Ave. Up Memory = 4839 mus
Ave. Down Memory = 5552 mus
sum up = inf
sum down = inf
Ave. Up Memory = 18638 mus
Ave. Down Memory = 19053 mus
from running this program:
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
template<class Iterator, typename T>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, T a, T b) {
std::random_device rnd_device;
std::mt19937 generator(rnd_device());
std::uniform_int_distribution<T> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class Iterator>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, double a, double b) {
std::random_device rnd_device;
std::mt19937_64 generator(rnd_device());
std::uniform_real_distribution<double> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class RAI, class T>
inline void sum_abs_up(RAI first, RAI one_past_last, T &total) {
T sum = 0;
auto it = first;
do {
sum += *it;
it++;
} while (it != one_past_last);
total += sum;
}
template<class RAI, class T>
inline void sum_abs_down(RAI first, RAI one_past_last, T &total) {
T sum = 0;
auto it = one_past_last;
do {
it--;
sum += *it;
} while (it != first);
total += sum;
}
template<class T> std::chrono::nanoseconds TimeDown(
std::vector<T> &vec, const std::vector<T> &vec_original,
std::size_t num_repititions, T &running_sum) {
std::chrono::nanoseconds total{0};
for (std::size_t i = 0; i < num_repititions; i++) {
auto start_time = std::chrono::high_resolution_clock::now();
sum_abs_down(vec.begin(), vec.end(), running_sum);
total += std::chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class T> std::chrono::nanoseconds TimeUp(
std::vector<T> &vec, const std::vector<T> &vec_original,
std::size_t num_repititions, T &running_sum) {
std::chrono::nanoseconds total{0};
for (std::size_t i = 0; i < num_repititions; i++) {
auto start_time = std::chrono::high_resolution_clock::now();
sum_abs_up(vec.begin(), vec.end(), running_sum);
total += std::chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
int main() {
std::size_t num_repititions = 1 << 10;
{
typedef int ValueType;
auto lower = std::numeric_limits<ValueType>::min();
auto upper = std::numeric_limits<ValueType>::max();
std::vector<ValueType> vec(1 << 24);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
std::cout << "sum up = " << sum_up << '\n';
std::cout << "sum down = " << sum_down << '\n';
std::cout << "Ave. Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
std::cout << "Ave. Down Memory = "<< time_down/(num_repititions * 1000) << " mus"
<< std::endl;
}
{
typedef double ValueType;
auto lower = std::numeric_limits<ValueType>::min();
auto upper = std::numeric_limits<ValueType>::max();
std::vector<ValueType> vec(1 << 24);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
std::cout << "sum up = " << sum_up << '\n';
std::cout << "sum down = " << sum_down << '\n';
std::cout << "Ave. Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
std::cout << "Ave. Down Memory = "<< time_down/(num_repititions * 1000) << " mus"
<< std::endl;
}
return 0;
}
Both sum_abs_up and sum_abs_down do the same thing and are timed they same way with the only difference being that sum_abs_up goes up memory while sum_abs_down goes down memory. I even pass vec by reference so that both functions access the same memory locations. Nevertheless, sum_abs_up is consistently faster than sum_abs_down. Give it a run yourself (I compiled it with g++ -O3).
FYI vec_original is there for experimentation, to make it easy for me to change sum_abs_up and sum_abs_down in a way that makes them alter vec while not allowing these changes to affect future timings.
It's important to note how tight the loop that I'm timing is. If a loop's body is large then it likely won't matter whether its iterator goes up or down memory since the time it takes to execute the loop's body will likely completely dominate. Also, it's important to mention that with some rare loops, going down memory is sometimes faster than going up it. But even with such loops it's rarely ever the case that going up was always slower than going down (unlike loops that go up memory, which are very often always faster than the equivalent down-memory loops; a small handful of times they were even 40+% faster).
The point is, as a rule of thumb, if you have the option, if the loop's body is small, and if there's little difference between having your loop go up memory instead of down it, then you should go up memory.