I'm not sure if the following code can cause redundant calculations, or is it compiler-specific?
for (int i = 0; i < strlen(ss); ++i)
{
// blabla
}
Will strlen() be calculated every time when i increases?
Yes, strlen() will be evaluated on each iteration. It's possible that, under ideal circumstances, the optimiser might be able to deduce that the value won't change, but I personally wouldn't rely on that.
I'd do something like
for (int i = 0, n = strlen(ss); i < n; ++i)
or possibly
for (int i = 0; ss[i]; ++i)
as long as the string isn't going to change length during the iteration. If it might, then you'll need to either call strlen() each time, or handle it through more complicated logic.
Yes, every time you use the loop. Then it will every time calculate the length of the string.
so use it like this:
char str[30];
for ( int i = 0; str[i] != '\0'; i++)
{
//Something;
}
In the above code str[i] only verifies one particular character in the string at location i each time the loop starts a cycle, thus it will take less memory and is more efficient.
See this Link for more information.
In the code below every time the loop runs strlen will count the length of the whole string which is less efficient, takes more time and takes more memory.
char str[];
for ( int i = 0; i < strlen(str); i++)
{
//Something;
}
A good compiler may not calculate it every time, but I don't think you can be sure, that every compiler does it.
In addition to that, the compiler has to know, that strlen(ss) does not change. This is only true if ss is not changed in for loop.
For example, if you use a read-only function on ss in for loop but don't declare the ss-parameter as const, the compiler cannot even know that ss is not changed in the loop and has to calculate strlen(ss) in every iteration.
If ss is of type const char * and you're not casting away the constness within the loop the compiler might only call strlen once, if optimizations are turned on. But this is certainly not behavior that can be counted upon.
You should save the strlen result in a variable and use this variable in the loop. If you don't want to create an additional variable, depending on what you're doing, you may be ale to get away with reversing the loop to iterate backwards.
for( auto i = strlen(s); i > 0; --i ) {
// do whatever
// remember value of s[strlen(s)] is the terminating NULL character
}
Formally yes, strlen() is expected to be called for every iteration.
Anyway I do not want to negate the possibility of the existance of some clever compiler optimisation, that will optimise away any successive call to strlen() after the first one.
The predicate code in it's entirety will be executed on every iteration of the for loop. In order to memoize the result of the strlen(ss) call the compiler would need to know that at least
The function strlen was side effect free
The memory pointed to by ss doesn't change for the duration of the loop
The compiler doesn't know either of these things and hence can't safely memoize the result of the first call
Yes. strlen will be calculated everytime when i increases.
If you didn't change ss with in the loop means it won't affect logic otherwise it will affect.
It is safer to use following code.
int length = strlen(ss);
for ( int i = 0; i < length ; ++ i )
{
// blabla
}
Yes, the strlen(ss) will calculate the length at each iteration. If you are increasing the ss by some way and also increasing the i; there would be infinite loop.
Yes, the strlen() function is called every time the loop is evaluated.
If you want to improve the efficiency then always remember to save everything in local variables... It will take time but it's very useful ..
You can use code like below:
String str="ss";
int l = strlen(str);
for ( int i = 0; i < l ; i++ )
{
// blablabla
}
Yes, strlen(ss) will be calculated every time the code runs.
Not common nowadays but 20 years ago on 16 bit platforms, I'd recommend this:
for ( char* p = str; *p; p++ ) { /* ... */ }
Even if your compiler isn't very smart in optimization, the above code can result in good assembly code yet.
Yes. The test doesn't know that ss doesn't get changed inside the loop. If you know that it won't change then I would write:
int stringLength = strlen (ss);
for ( int i = 0; i < stringLength; ++ i )
{
// blabla
}
Arrgh, it will, even under ideal circumstances, dammit!
As of today (January 2018), and gcc 7.3 and clang 5.0, if you compile:
#include <string.h>
void bar(char c);
void foo(const char* __restrict__ ss)
{
for (int i = 0; i < strlen(ss); ++i)
{
bar(*ss);
}
}
So, we have:
ss is a constant pointer.
ss is marked __restrict__
The loop body cannot in any way touch the memory pointed to by ss (well, unless it violates the __restrict__).
and still, both compilers execute strlen() every single iteration of that loop. Amazing.
This also means the allusions/wishful thinking of #Praetorian and #JaredPar doesn't pan out.
YES, in simple words.
And there is small no in rare condition in which compiler is wishing to, as an optimization step if it finds that there is no changes made in ss at all. But in safe condition you should think it as YES. There are some situation like in multithreaded and event driven program, it may get buggy if you consider it a NO.
Play safe as it is not going to improve the program complexity too much.
Yes.
strlen() calculated everytime when i increases and does not optimized.
Below code shows why the compiler should not optimize strlen().
for ( int i = 0; i < strlen(ss); ++i )
{
// Change ss string.
ss[i] = 'a'; // Compiler should not optimize strlen().
}
We can easily test it :
char nums[] = "0123456789";
size_t end;
int i;
for( i=0, end=strlen(nums); i<strlen(nums); i++ ) {
putchar( nums[i] );
num[--end] = 0;
}
Loop condition evaluates after each repetition, before restarting the loop .
Also be careful about the type you use to handle length of strings . it should be size_t which has been defined as unsigned int in stdio. comparing and casting it to int might cause some serious vulnerability issue.
well, I noticed that someone is saying that it is optimized by default by any "clever" modern compiler. By the way look at results without optimization. I tried: Minimal C code:
#include <stdio.h>
#include <string.h>
int main()
{
char *s="aaaa";
for (int i=0; i<strlen(s);i++)
printf ("a");
return 0;
}
My compiler: g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Command for generation of assembly code: g++ -S -masm=intel test.cpp
Gotten assembly code at the output:
...
L3:
mov DWORD PTR [esp], 97
call putchar
add DWORD PTR [esp+40], 1
.L2:
THIS LOOP IS HERE
**<b>mov ebx, DWORD PTR [esp+40]
mov eax, DWORD PTR [esp+44]
mov DWORD PTR [esp+28], -1
mov edx, eax
mov eax, 0
mov ecx, DWORD PTR [esp+28]
mov edi, edx
repnz scasb</b>**
AS YOU CAN SEE it's done every time
mov eax, ecx
not eax
sub eax, 1
cmp ebx, eax
setb al
test al, al
jne .L3
mov eax, 0
.....
Elaborating on Prætorian's answer I recommend the following:
for( auto i = strlen(s)-1; i > 0; --i ) {foo(s[i-1];}
auto because you don't want to care about which type strlen returns. A C++11 compiler (e.g. gcc -std=c++0x, not completely C++11 but auto types work) will do that for you.
i = strlen(s) becuase you want to compare to 0 (see below)
i > 0 because comparison to 0 is (slightly) faster that comparison to any other number.
disadvantage is that you have to use i-1 in order to access the string characters.
Related
This question already has answers here:
Performance: memset
(2 answers)
Why might std::vector be faster than a raw dynamically allocated array?
(2 answers)
Why is iterating though `std::vector` faster than iterating though `std::array`?
(2 answers)
Idiomatic way of performance evaluation?
(1 answer)
Closed 3 years ago.
When I run the following program (with optimization on), the for loop with the std::vector takes about 0.04 seconds while the for loop with the array takes 0.0001 seconds.
#include <iostream>
#include <vector>
#include <chrono>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "The vector took " << elapsed.count() << "seconds\n";
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
finish = std::chrono::high_resolution_clock::now();
elapsed = finish - start;
std::cout << "The array took " << elapsed.count() << "seconds \n";
char s;
std::cin >> s;
delete[] Data;
}
The code is a simplified version of a performance issue I was having while writing a raycaster. The len variable corresponds to how many times the loop in the original program needs to run (400 pixels * 400 pixels * 50 maximum render distance). For complicated reasons (perhaps that I don't fully understand how to use arrays) I have to use a vector rather than an array in the actual raycaster. However, as this program demonstrates, that would only give me 20 frames per second as opposed to the envied 10,000 frames per second that using an array would supposedly give me (obviously, this is just a simplified performance test). But regardless of how accurate those numbers are, I still want to boost my frame rate as much as possible. So, why is the vector performing so much slower than the array? Is there a way to speed it up? Thanks for your help. And if there's anything else I'm doing weirdly that might be affecting performance, please let me know. I didn't even know about optimization until researching an answer for this question, so if there are any more things like that which might boost the performance, please let me know (and I'd prefer if you explained where those settings are in the properties manager rather than command line since I don't yet know how to use the command line)
Let us observe how GCC optimizes this test program:
#include <vector>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
delete[] Data;
}
The compiler rightly notices that the vector is constant, and eliminates it. Exactly same code is generated for both loops. Therefore it should be irrelevant whether the first loop uses array or vector.
.L2:
movups XMMWORD PTR [rcx], xmm0
add rcx, 16
cmp rsi, rcx
jne .L2
What makes difference in your test program is the order of loops. The comments point out that when a third loop is added to the beginning, both loops take the same time.
I would expect that with a modern compiler accessing a vector would be approximately as fast as accessing an array, when optimization is enabled and debug is disabled. If there is an observable difference in your actual program, the problem lies somewhere else.
It is about caches. I dont know how it works detailed but Data[] is getting known better by cpu while it is used. If you reverse the order of calculation you can see 'vector is faster'.
But actually, you are testing neither vector nor array. Let's assume that vec[0] resides at 0x01 memory location, arr[0] resides at 0xf1. Only difference is reading a word from different single memory adresses. So you are testing how fast can I assign a value to elements of dynamically allocated array.
Note: std::chrono::high_resolution_clock might not be sufficient to measure ticks. It is better to use steady_clock as cppreference says.
I am writing a C++ algorithm that takes two strings and returns true if you can mutate from string a to string b by changing a single character to another.
The two strings must equal in size, and can only have one difference.
I also need to have access to the index that changed, and the character of strA that was altered.
I found a working algorithm, but it iterates through every single pair of words and is running way too slow on any large amount of input.
bool canChange(std::string const& strA, std::string const& strB, char& letter)
{
int dif = 0;
int position = 0;
int currentSize = (int)strA.size();
if(currentSize != (int)strB.size())
{
return false;
}
for(int i = 0; i < currentSize; ++i)
{
if(strA[i] != strB[i])
{
dif++;
position = i;
if(dif > 1)
{
return false;
}
}
}
if(dif == 1)
{
letter = strA[position];
return true;
}
else return false;
}
Any advice on optimization?
It's a bit hard to get away from examining all the characters in the strings, unless you can accept the occasional incorrect result.
I suggest using features of the standard library, and not trying to count the number of mismatches. For example;
#include <string>
#include <algorithm>
bool canChange(std::string const& strA, std::string const& strB, char& letter, std::size_t &index)
{
bool single_mismatch = false;
if (strA.size() == strB.size())
{
typedef std::string::const_iterator ci;
typedef std::pair<ci, ci> mismatch_result;
ci begA(strA.begin()), endA(strA.end());
mismatch_result result = std::mismatch(begA, endA, strB.begin());
if (result.first != endA) // found a mismatch
{
letter = *(result.first);
index = std::distance(begA, result.first);
// now look for a second mismatch
std::advance(result.first, 1);
std::advance(result.second, 1);
single_mismatch = (std::mismatch(result.first, endA, result.second).first == endA);
}
}
return single_mismatch;
}
This works for all versions. It can be simplified a little in C++11.
If the above returns true, then a single mismatch was found.
If the return value is false, then either the strings are different sizes, or the number of mismatches is not equal to 1 (either the strings are equal, or have more than one mismatch).
letter and index are unchanged if the strings are of different lengths or are exactly equal, but otherwise identify the first mismatch (value of the character in strA, and index).
If you want to optimize for mostly-identical strings, you could use x86 SSE/AVX vector instructions. Your basic idea looks fine: break as soon as you detect a second difference.
To find and count character differences, a sequence like PCMPEQB / PMOVMSKB / test-and-branch is probably good. (Use C/C++ intrinsic functions to get those vector instructions). When your vector loop detects non-zero differences in the current block, POPCNT the bitmask to see if you just found the first difference, or if you found two differences in the same block.
I threw together an untested and not-fully-fleshed out AVX2 version of what I'm describing. This code assumes string lengths are a multiple of 32. Stopping early and handling the last chunk with a cleanup epilogue is left as an exercise for the reader.
#include <immintrin.h>
#include <string>
// not tested, and doesn't avoid reading past the end of the string.
// TODO: epilogue to handle the last up-to-31 left-over bytes separately.
bool canChange_avx2_bmi(std::string const& strA, std::string const& strB, char& letter) {
size_t size = strA.size();
if (size != strB.size())
return false;
int diffs = 0;
size_t diffpos = 0;
size_t pos = 0;
do {
uint32_t diffmask = 0;
while( pos < size ) {
__m256i vecA = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strA[pos]));
__m256i vecB = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strB[pos]));
__m256i vdiff = _mm256_cmpeq_epi8(vecA, vecB);
diffmask = _mm256_movemask_epi8(vdiff);
pos += 32;
if (diffmask) break; // gcc makes worse code if you include && !diffmask in the while condition, instead of this break
}
if (diffmask) {
diffpos = pos + _tzcnt_u32(diffmask); // position of the lowest set bit. Could safely use BSF rather than TZCNT here, since we only run when diffmask is non-zero.
diffs += _mm_popcnt_u32(diffmask);
}
} while(pos < size && diffs <= 1);
if (diffs == 1) {
letter = strA[diffpos];
return true;
}
return false;
}
The ugly break instead of including that in the while condition apparently helps gcc generate better code. The do{}while() also matches up with how I want the asm to come out. I didn't try using a for or while loop to see what gcc would do.
The inner loop is really tight this way:
.L14:
cmp rcx, r8
jnb .L10 # the while(pos<size) condition
.L6: # entry point for first iteration, because gcc duplicates the pos<size test ahead of the loop
vmovdqu ymm0, YMMWORD PTR [r9+rcx] # tmp118,* pos
vpcmpeqb ymm0, ymm0, YMMWORD PTR [r10+rcx] # tmp123, tmp118,* pos
add rcx, 32 # pos,
vpmovmskb eax, ymm0 # tmp121, tmp123
test eax, eax # tmp121
je .L14 #,
In theory, this should run at one iteration per 2 clocks (Intel Haswell). There are 7 fused-domain uops in the loop. (Would be 6, but 2-reg addressing modes apparently can't micro-fuse on SnB-family CPUs.) Since two of the uops are loads, not ALU, this throughput might be achievable on SnB/IvB as well.
This should be exceptionally good for flying over regions where the two strings are identical. The overhead of correctly handling arbitrary string lengths will make this potentially slower than a simple scalar function if strings are short, and/or have multiple differences early on.
How big is your input?
I'd think that the strA[i], strB[i] has function call overhead unless it's inlined. So make sure you do your performance test with inlining turned on and compiled with release. Otherwise, try getting the bytes as a char* with strA.c_str().
If all that fails and it's still not fast enough, try breaking you string into chunks and using memcmp or strncmp on the chunks. If no difference, move to the next chunk until you reach the end or find a difference. If a difference is found, do your trivial byte by byte compare until you find the difference. I suggest this route because memcmp is often faster than your trivial implementations as they can make use of the processor SSE extensions and so forth to do very fast compares.
Also, there is a problem with your code. You're assuming strA is longer than strB and only checking the length of A for the array accessors.
Benchmarking this class:
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= (int)sqrt((double)n); ++i)
if (isPrime[i])
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
}
};
I'm getting over 3 times worse performance (CPU time) with 64-bit binary vs. 32-bit version (release build) when calling a constructor for a large number, e.g.
Sieve s(100000000);
I tested sizeof(bool) and it is 1 for both versions.
When I substitute vector<bool> with vector<char> performance becomes the same for 64-bit and 32-bit versions. Why is that?
Here are the run times for S(100000000) (release mode, 32-bit first, 64-bit second)):
vector<bool> 0.97s 3.12s
vector<char> 0.99s 0.99s
vector<int> 1.57s 1.59s
I also did a sanity test with VS2010 (prompted by Wouter Huysentruit's response), which produced 0.98s 0.88s. So there is something wrong with VS2012 implementation.
I submitted a bug report to Microsoft Connect
EDIT
Many answers below comment on deficiencies of using int for indexing. This may be true, but even the Great Wizard himself is using a standard for (int i = 0; i < v.size(); ++i) in his books, so such a pattern should not incur a significant performance penalty. Additionally, this issue was raised during Going Native 2013 conference and the presiding group of C++ gurus commented on their early recommendations of using size_t for indexing and as a return type of size() as a mistake. They said: "we are sorry, we were young..."
The title of this question could be rephrased to: Over 3 times performance drop on this code when upgrading from VS2010 to VS2012.
EDIT
I made a crude attempt at finding memory alignment of indexes i and j and discovered that this instrumented version:
struct Sieve {
vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= sqrt((double)n); ++i) {
if (i == 17) cout << ((int)&i)%16 << endl;
if (isPrime[i])
for (int j = i*i; j <= n; j += i) {
if (j == 4) cout << ((int)&j)%16 << endl;
isPrime[j] = false;
}
}
}
};
auto-magically runs fast now (only 10% slower than 32-bit version). This and VS2010 performance makes it hard to accept a theory of optimizer having inherent problems dealing with int indexes instead of size_t.
std::vector<bool> is not directly at fault here. The performance difference is ultimately caused by your use of the signed 32-bit int type in your loops and some rather poor register allocation by the compiler. Consider, for example, your innermost loop:
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
Here, j is a 32-bit signed integer. When it is used in isPrime[j], however, it must be promoted (and sign-extended) to a 64-bit integer, in order to perform the subscript computation. The compiler can't just treat j as a 64-bit value, because that would change the behavior of the loop (e.g. if n is negative). The compiler also can't perform the index computation using the 32-bit quantity j, because that would change the behavior of that expression (e.g. if j is negative).
So, the compiler needs to generate code for the loop using a 32-bit j then it must generate code to convert that j to a 64-bit integer for the subscript computation. It has to do the same thing for the outer loop with i. Unfortunately, it looks like the compiler allocates registers rather poorly in this case(*)--it starts spilling temporaries to the stack, causing the performance hit you see.
If you change your program to use size_t everywhere (which is 32-bit on x86 and 64-bit on x64), you will observe that the performance is on par with x86, because the generated code needs only work with values of one size:
Sieve (size_t n = 1)
{
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (size_t i = 2; i <= static_cast<size_t>(sqrt((double)n)); ++i)
if (isPrime[i])
for (size_t j = i*i; j <= n; j += i)
isPrime[j] = false;
}
You should do this anyway, because mixing signed and unsigned types, especially when those types are of different widths, is perilous and can lead to unexpected errors.
Note that using std::vector<char> also "solves" the problem, but for a different reason: the subscript computation required for accessing an element of a std::vector<char> is substantially simpler than that for accessing an element of std::vector<bool>. The optimizer is able to generate better code for the simpler computations.
(*) I don't work on code generation, and I'm hardly an expert in either assembly or low-level performance optimization, but from looking at the generated code, and given that it is reported here that Visual C++ 2010 generates better code, I'd guess that there are opportunities for improvement in the compiler. I'll make sure the Connect bug you opened gets forwarded on to the compiler team so they can take a look.
I've tested this with vector<bool> in VS2010: 32-bit needs 1452ms while 64-bit needs 1264ms to complete on a i3.
The same test in VS2012 (on i7 this time) needs 700ms (32-bit) and 2730ms (64-bit), so there is something wrong with the compiler in VS2012. Maybe you can report this test case as a bug to Microsoft.
UPDATE
The problem is that the VS2012 compiler uses a temporary stack variable for a part of the code in the inner for-loop when using int as iterator. The assembly parts listed below are part of the code inside <vector>, in the += operator of the std::vector<bool>::iterator.
size_t as iterator
When using size_t as iterator, a part of the code looks like this:
or rax, -1
sub rax, rdx
shr rax, 5
lea rax, QWORD PTR [rax*4+4]
sub r8, rax
Here, all instructions use CPU registers which are very fast.
int as iterator
When using int as iterator, that same part looks like this:
or rcx, -1
sub rcx, r8
shr rcx, 5
shl rcx, 2
mov rax, -4
sub rax, rcx
mov rdx, QWORD PTR _Tmp$6[rsp]
add rdx, rax
Here you see the _Tmp$6 stack variable being used, which causes the slowdown.
Point compiler into the right direction
The funny part is that you can point the compiler into the right direction by using the vector<bool>::iterator directly.
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign(n + 1, true);
std::vector<bool>::iterator it1 = isPrime.begin();
std::vector<bool>::iterator end = it1 + n;
*it1++ = false;
*it1++ = false;
for (int i = 2; i <= (int)sqrt((double)n); ++it1, ++i)
if (*it1)
for (std::vector<bool>::iterator it2 = isPrime.begin() + i * i; it2 <= end; it2 += i)
*it2 = false;
}
};
vector<bool> is a very special container that's specialized to use 1 bit per item rather than providing normal container semantics. I suspect that the bit manipulation logic is much more expensive when compiling 64 bits (either it still uses 32 bit chunks to hold the bits or some other reason). vector<char> behaves just like a normal vector so there's no special logic.
You could also use deque<bool> which doesn't have the specialization.
For example, assuming a string s is this:
for(int x = 0; x < s.length(); x++)
better than this?:
int length = s.length();
for(int x = 0; x < length; x++)
Thanks,
Joel
In general, you should avoid function calls in the condition part of a loop, if the result does not change during the iteration.
The canonical form is therefore:
for (std::size_t x = 0, length = s.length(); x != length; ++x);
Note 3 things here:
The initialization can initialize more than one variable
The condition is expressed with != rather than <
I use pre-increment rather than post-increment
(I also changed the type because is a negative length is non-sense and the string interface is defined in term of std::string::size_type, which is normally std::size_t on most implementations).
Though... I admit that it's not as much for performance than for readability:
The double initialization means that both x and length scope is as tight as necessary
By memoizing the result the reader is not left in the doubt of whether or not the length may vary during iteration
Using pre-increment is usually better when you do not need to create a temporary with the "old" value
In short: use the best tool for the job at hand :)
It depends on the inlining and optimization abilities of the compiler. Generally, the second variant will most likely be faster (better: it will be either faster or as fast as the first snippet, but almost never slower).
However, in most cases it doesn't matter, so people tend to prefer the first variant for its shortness.
It depends on your C++ implementation / library, the only way to be sure is to benchmark it. However, it's effectively certain that the second version will never be slower than the first, so if you don't modify the string within the loop it's a sensible optimisation to make.
How efficient do you want to be?
If you don't modify the string inside the loop, the compiler will easily see than the size doesn't change. Don't make it any more complicated than you have to!
Although I am not necessarily encouraging you to do so, it appears it is faster to constantly call .length() than to store it in an int, surprisingly (atleast on my computer, keeping in mind that I'm using an MSI gaming laptop with i5 4th gen, but it shouldn't really affect which way is faster).
Test code for constant call:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
//int b = g.length();
for(unsigned int rep = 0; rep < g.length(); rep++)
{
a++;
}
return a;
}
On average, this ran for 385ms according to Code::Blocks
And here's the code that stores the length in a variable:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
int b = g.length();
for(unsigned int rep = 0; rep < b; rep++)
{
a++;
}
return a;
}
And this averaged around 420ms.
I know this question already has an accepted answer, but there haven't been any practically tested answers, so I decided to throw my 2 cents in. I had the same question as you, but I didn't find any helpful answers here, so I ran my own experiment.
Is s.length() inline and returns a member variable? then no, otherwise cost of dereferencing and putting stuff in stack, you know all the overheads of function call you will incur for each iteration.
Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.
There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).
Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.
In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.
The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.
For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);
The first one will be faster because you loop from 1 to 10000 only one time.
C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).
If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.
The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.
If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?