Are C++ enums slower to use than integers? - c++

It's really a simple problem :
I'm programming a Go program. Should I represent the board with a QVector<int> or a QVector<Player> where
enum Player
{
EMPTY = 0,
BLACK = 1,
WHITE = 2
};
I guess that of course, using Player instead of integers will be slower. But I wonder how much more, because I believe that using enum is better coding.
I've done a few tests regarding assigning and comparing Players (as opposed to int)
QVector<int> vec;
vec.resize(10000000);
int size = vec.size();
for(int i =0; i<size; ++i)
{
vec[i] = 0;
}
for(int i =0; i<size; ++i)
{
bool b = (vec[i] == 1);
}
QVector<Player> vec2;
vec2.resize(10000000);
int size = vec2.size();
for(int i =0; i<size; ++i)
{
vec2[i] = EMPTY;
}
for(int i =0; i<size; ++i)
{
bool b = (vec2[i] == BLACK);
}
Basically, it's only 10% slower. Is there anything else I should know before continuing?
Thanks!
Edit : The 10% difference is not a figment of my imagination, it seems to be specific to Qt and QVector. When I use std::vector, the speed is the same

Enums are completely resolved at compile time (enum constants as integer literals, enum variables as integer variables), there's no speed penalty in using them.
In general the average enumeration won't have an underlying type bigger than int (unless you put in it very big constants); in facts, at §7.2 ¶ 5 it's explicitly said:
The underlying type of an enumeration is an integral type that can represent all the enumerator values defined in the enumeration. It is implementation-defined which integral type is used as the underlying type for an enumeration except that the underlying type shall not be larger than int unless the value of an enumerator cannot fit in an int or unsigned int.
You should use enumerations when it's appropriate because they usually make the code easier to read and to maintain (have you ever tried to debug a program full of "magic numbers"? :S).
As for your results: probably your test methodology doesn't take into account the normal speed fluctuations you get when you run code on "normal" machines1; have you tried running the test many (100+) times and calculating mean and standard deviation of your times? The results should be compatible: the difference between the means shouldn't be bigger than 1 or 2 times the RSS2 of the two standard deviations (assuming, as usual, a Gaussian distribution for the fluctuations).
Another check you could do is to compare the generated assembly code (with g++ you can get it with the -S switch).
On "normal" PCs you have some indeterministic fluctuations because of other tasks running, cache/RAM/VM state, ...
Root Sum Squared, the square root of the sum of the squared standard deviations.

In general, using an enum should make absolutely no difference to performance. How did you test this?
I just ran tests myself. The differences are pure noise.
Just now, I compiled both versions to assembler. Here's the main function from each:
int
LFB1778:
pushl %ebp
LCFI11:
movl %esp, %ebp
LCFI12:
subl $8, %esp
LCFI13:
movl $65535, %edx
movl $1, %eax
call __Z41__static_initialization_and_destruction_0ii
leave
ret
Player
LFB1774:
pushl %ebp
LCFI10:
movl %esp, %ebp
LCFI11:
subl $8, %esp
LCFI12:
movl $65535, %edx
movl $1, %eax
call __Z41__static_initialization_and_destruction_0ii
leave
ret
It's hazardous to base any statement regarding performance on micro-benchmarks. There are too many extraneous factors skewing the data.

Enums should be no slower. They're implemented as integers.

if you use Visual Studio for example you can create a simple project where you have
a=Player::EMPTY;
and if you right click "go to disassembly" the code will be
mov dword ptr [a],0
So the compiler replace the value of the enum, and normally it will not generate any overhead.

Well, I did a few tests and there wasn't much difference between the integer and enum forms. I also added a char form which was consistently about 6% quicker (which isn't surprising as it is using less memory). Then I just used a char array rather than a vector and that was 300% faster! Since we've not been given what QVector is, it could be a wrapper for an array rather than the std::vector I've used.
Here's the code I used, compiled using standard release options in Dev Studio 2005. Note that I've changed the timed loop a small amount as the code in the question could be optimised to nothing (you'd have to check the assembly code).
#include <windows.h>
#include <vector>
#include <iostream>
using namespace std;
enum Player
{
EMPTY = 0,
BLACK = 1,
WHITE = 2
};
template <class T, T search>
LONGLONG TimeFunction ()
{
vector <T>
vec;
vec.resize (10000000);
size_t
size = vec.size ();
for (size_t i = 0 ; i < size ; ++i)
{
vec [i] = static_cast <T> (rand () % 3);
}
LARGE_INTEGER
start,
end;
QueryPerformanceCounter (&start);
for (size_t i = 0 ; i < size ; ++i)
{
if (vec [i] == search)
{
break;
}
}
QueryPerformanceCounter (&end);
return end.QuadPart - start.QuadPart;
}
LONGLONG TimeArrayFunction ()
{
size_t
size = 10000000;
char
*vec = new char [size];
for (size_t i = 0 ; i < size ; ++i)
{
vec [i] = static_cast <char> (rand () % 3);
}
LARGE_INTEGER
start,
end;
QueryPerformanceCounter (&start);
for (size_t i = 0 ; i < size ; ++i)
{
if (vec [i] == 10)
{
break;
}
}
QueryPerformanceCounter (&end);
delete [] vec;
return end.QuadPart - start.QuadPart;
}
int main ()
{
cout << " Char form = " << TimeFunction <char, 10> () << endl;
cout << "Integer form = " << TimeFunction <int, 10> () << endl;
cout << " Player form = " << TimeFunction <Player, static_cast <Player> (10)> () << endl;
cout << " Array form = " << TimeArrayFunction () << endl;
}

The compiler should convert enum into integers. They get inlined at compile time, so once your program is compiled, it's supposed to be exactly the same as if you used the integers themselves.
If your testing produces different results, there could be something going on with the test itself. Either that, or your compiler is behaving oddly.

This is implementation dependent, and it is quite possible for enums and ints to have different performance and either the same or different assembly code, although it is probably a sign of a suboptimal compiler. some ways to get differences are:
QVector may be specialized on your enum type to do something surprising.
enum doesn't get compiled to int but to "some integral type no larger than int". QVector of int may be specialized differently from QVector of some_integral_type.
even if QVector isn't specialized, the compiler may do a better job of aligning ints in memory than of aligning some_integral_type, leading to a greater cache miss rate when you loop over the vector of enums or of some_integral_type.

Related

Why global arrays does not consume memory in readonly mode?

The following code declare a global array (256 MiB) and calculate sum of it's items.
This program consumes 188 KiB when runing:
#include <cstdlib>
#include <iostream>
using namespace std;
const unsigned int buffer_len = 256 * 1024 * 1024; // 256 MB
unsigned char buffer[buffer_len];
int main()
{
while(true)
{
int sum = 0;
for(int i = 0; i < buffer_len; i++)
sum += buffer[i];
cout << "Sum: " << sum << endl;
}
return 0;
}
The following code is like the above code, but sets array elements to random value before calculating the sum of array items.
This program consumes 256.1 MiB when runing:
#include <cstdlib>
#include <iostream>
using namespace std;
const unsigned int buffer_len = 256 * 1024 * 1024; // 256 MB
unsigned char buffer[buffer_len];
int main()
{
while(true)
{
// ********** Changing array items.
for(int i = 0; i < buffer_len; i++)
buffer[i] = std::rand() % 256;
// **********
int sum = 0;
for(int i = 0; i < buffer_len; i++)
sum += buffer[i];
cout << "Sum: " << sum << endl;
}
return 0;
}
Why global arrays does not consume memory in readonly mode (188K vs 256M)?
My Compiler: GCC
MY OS: Ubuntu 20.04
Update:
In my real scenario I will generate the buffer with xxd command, so it's elements are not zero:
$ xxd -i buffer.dat buffer.cpp
There's much speculation in the comments that this behavior is explained by compiler optimizations, but OP's use of g++ (i.e. without optimization) doesn't support this, and neither does the assembly output on the same architecture, which clearly shows buffer being used:
buffer:
.zero 268435456
.section .rodata
...
.L3:
movl -4(%rbp), %eax
cmpl $268435455, %eax
ja .L2
movl -4(%rbp), %eax
cltq
leaq buffer(%rip), %rdx
movzbl (%rax,%rdx), %eax
movzbl %al, %eax
addl %eax, -8(%rbp)
addl $1, -4(%rbp)
jmp .L3
The real reason you're seeing this behavior is the use of Copy On Write in the kernel's VM system. Essentially for a large buffer of zeros like you have here, the kernel will create a single "zero page" and point all pages in buffer to this page. Only when the page is written will it get allocated.
This is actually true in your second example as well (i.e. the behavior is the same), but you're touching every page with data, which forces the memory to be "paged in" for the entire buffer. Try only writing buffer_len/2, and you'll see that 128.2MiB of memory gets allocated by the process. Half of the true size of the buffer:
Here's also a helpful summary from Wikipedia:
The copy-on-write technique can be extended to support efficient memory allocation by having a page of physical memory filled with zeros. When the memory is allocated, all the pages returned refer to the page of zeros and are all marked copy-on-write. This way, physical memory is not allocated for the process until data is written, allowing processes to reserve more virtual memory than physical memory and use memory sparsely, at the risk of running out of virtual address space.

Why is std::vector slower than an array? [duplicate]

This question already has answers here:
Performance: memset
(2 answers)
Why might std::vector be faster than a raw dynamically allocated array?
(2 answers)
Why is iterating though `std::vector` faster than iterating though `std::array`?
(2 answers)
Idiomatic way of performance evaluation?
(1 answer)
Closed 3 years ago.
When I run the following program (with optimization on), the for loop with the std::vector takes about 0.04 seconds while the for loop with the array takes 0.0001 seconds.
#include <iostream>
#include <vector>
#include <chrono>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "The vector took " << elapsed.count() << "seconds\n";
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
finish = std::chrono::high_resolution_clock::now();
elapsed = finish - start;
std::cout << "The array took " << elapsed.count() << "seconds \n";
char s;
std::cin >> s;
delete[] Data;
}
The code is a simplified version of a performance issue I was having while writing a raycaster. The len variable corresponds to how many times the loop in the original program needs to run (400 pixels * 400 pixels * 50 maximum render distance). For complicated reasons (perhaps that I don't fully understand how to use arrays) I have to use a vector rather than an array in the actual raycaster. However, as this program demonstrates, that would only give me 20 frames per second as opposed to the envied 10,000 frames per second that using an array would supposedly give me (obviously, this is just a simplified performance test). But regardless of how accurate those numbers are, I still want to boost my frame rate as much as possible. So, why is the vector performing so much slower than the array? Is there a way to speed it up? Thanks for your help. And if there's anything else I'm doing weirdly that might be affecting performance, please let me know. I didn't even know about optimization until researching an answer for this question, so if there are any more things like that which might boost the performance, please let me know (and I'd prefer if you explained where those settings are in the properties manager rather than command line since I don't yet know how to use the command line)
Let us observe how GCC optimizes this test program:
#include <vector>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
delete[] Data;
}
The compiler rightly notices that the vector is constant, and eliminates it. Exactly same code is generated for both loops. Therefore it should be irrelevant whether the first loop uses array or vector.
.L2:
movups XMMWORD PTR [rcx], xmm0
add rcx, 16
cmp rsi, rcx
jne .L2
What makes difference in your test program is the order of loops. The comments point out that when a third loop is added to the beginning, both loops take the same time.
I would expect that with a modern compiler accessing a vector would be approximately as fast as accessing an array, when optimization is enabled and debug is disabled. If there is an observable difference in your actual program, the problem lies somewhere else.
It is about caches. I dont know how it works detailed but Data[] is getting known better by cpu while it is used. If you reverse the order of calculation you can see 'vector is faster'.
But actually, you are testing neither vector nor array. Let's assume that vec[0] resides at 0x01 memory location, arr[0] resides at 0xf1. Only difference is reading a word from different single memory adresses. So you are testing how fast can I assign a value to elements of dynamically allocated array.
Note: std::chrono::high_resolution_clock might not be sufficient to measure ticks. It is better to use steady_clock as cppreference says.

running speed of permutation function using different methods results in unexpected results

I have implemented a isPermutation function which given two string will return true if the two are permutation of each other, otherwise it will return false.
One uses c++ sort algorithm twice, while the other uses an array of ints to keep track of string count.
I ran the code several times and every time the sorting method is faster. Is my array implementation wrong?
Here is the output:
1
0
1
Time: 0.088 ms
1
0
1
Time: 0.014 ms
And the code:
#include <iostream> // cout
#include <string> // string
#include <cstring> // memset
#include <algorithm> // sort
#include <ctime> // clock_t
using namespace std;
#define MAX_CHAR 255
void PrintTimeDiff(clock_t start, clock_t end) {
std::cout << "Time: " << (end - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
}
// using array to keep a count of used chars
bool isPermutation(string inputa, string inputb) {
int allChars[MAX_CHAR];
memset(allChars, 0, sizeof(int) * MAX_CHAR);
for(int i=0; i < inputa.size(); i++) {
allChars[(int)inputa[i]]++;
}
for (int i=0; i < inputb.size(); i++) {
allChars[(int)inputb[i]]--;
if(allChars[(int)inputb[i]] < 0) {
return false;
}
}
return true;
}
// using sorting anc comparing
bool isPermutation_sort(string inputa, string inputb) {
std::sort(inputa.begin(), inputa.end());
std::sort(inputb.begin(), inputb.end());
if(inputa == inputb) return true;
return false;
}
int main(int argc, char* argv[]) {
clock_t start = clock();
cout << isPermutation("god", "dog") << endl;
cout << isPermutation("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
start = clock();
cout << isPermutation_sort("god", "dog") << endl;
cout << isPermutation_sort("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation_sort("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
return 0;
}
To benchmark this you have to eliminate all the noise you can.
The easiest way to do this is to wrap it in a loop that repeats the call to each 1000 times or so, then only spit out the value every 10 iterations. This way they each have a similar caching profile. Throw away values that are bogus due (eg blowouts due to context switches by the OS).
I got your method marginally faster by doing this. An excerpt.
method 1 array Time: 0.768 us
method 2 sort Time: 0.840333 us
method 1 array Time: 0.621333 us
method 2 sort Time: 0.774 us
method 1 array Time: 0.769 us
method 2 sort Time: 0.856333 us
method 1 array Time: 0.766 us
method 2 sort Time: 0.850333 us
method 1 array Time: 0.802667 us
method 2 sort Time: 0.89 us
method 1 array Time: 0.778 us
method 2 sort Time: 0.841333 us
I used rdtsc which works better for me on this system. 3000 cycles per microsecond is close enough for this, but please do make it more accurate if you care about precision of the readings.
#if defined(__x86_64__)
static uint64_t rdtsc()
{
uint64_t hi, lo;
__asm__ __volatile__ (
"xor %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a"(lo), "=d"(hi)
:: "ebx", "ecx");
return (hi << 32)|lo;
}
#else
#error wrong architecture - implement me
#endif
void PrintTimeDiff(uint64_t start, uint64_t end) {
std::cout << "Time: " << (end - start)/double(3000) << " us" << std::endl;
}
you cannot check performance differences between implementations putting in the mix calls to std::cout. isPermutation and isPermutation_sort are some order of magnitude faster than a call to std::cout (and, anyway, prefer \n over std::endl).
for testing you have to activate compiler optimizations. Doing so the compiler will apply the loop-invariant code motion optimization and you'll probably get the same results for both implementations.
A more effective way of testing is:
int main()
{
const std::vector<std::string> bag
{
"god", "dog", "thisisaratherlongerinput", "thisisarathershorterinput",
"armen", "ramen"
};
static std::mt19937 engine;
std::uniform_int_distribution<std::size_t> rand(0, bag.size() - 1);
const unsigned stop = 1000000;
unsigned counter = 0;
std::clock_t start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
counter = 0;
start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation_sort(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
return 0;
}
I have 2.4s for isPermutations_sort vs 2s for isPermutation (somewhat similar to Hal's results). Same with g++ and clang++.
Printing the value of counter has the double benefit of:
triggering the as-if rule (the compiler cannot remove the for loops);
allowing a first check of your implementations (the two values cannot be too distant).
There're some things you have to change in your implementation of isPermutation:
pass arguments as const references
bool isPermutation(const std::string &inputa, const std::string &inputb)
just this change brings the time down to 0.8s (of course you cannot do the same with isPermutation_sort).
you can use std::array and std::fill instead of memset (this is C++ :-)
avoid premature pessimization and prefer preincrement. Only use postincrement if you're going to use the original value
do not mix signed and unsigned value in the for loops (inputa.size() and i). i should be declared as std::size_t
even better, use the range based for loop.
So something like:
bool isPermutation(const std::string &inputa, const std::string &inputb)
{
std::array<int, MAX_CHAR> allChars;
allChars.fill(0);
for (auto c : inputa)
++allChars[(unsigned char)c];
for (auto c : inputb)
{
--allChars[(unsigned char)c];
if (allChars[(unsigned char)c] < 0)
return false;
}
return true;
}
Anyway both isPermutation and isPermutation_sort should have this preliminary check:
if (inputa.length() != inputb.length())
return false;
Now we are at 0.55s for isPermutation vs 1.1s for isPermutation_sort.
Last but not least consider std::is_permutation:
for (unsigned i(0); i < stop; ++i)
{
const std::string &s1(bag[rand(engine)]), &s2(bag[rand(engine)]);
counter += std::is_permutation(s1.begin(), s1.end(), s2.begin());
}
(0.6s)
EDIT
As observed in BeyelerStudios' comment Mersenne-Twister is too much in this case.
You can change the engine to a simpler one.:
static std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> engine;
This further lowers the timings. Luckily the relative speeds remain the same.
Just to be sure I've also checked with a non random access scheme obtaining the same relative results.
Your idea amounts to using a Counting Sort on both strings, but with the comparison happening on the count array, rather than after writing out sorted strings.
It works well because a byte can only have one of 255 non-zero values. Zeroing 256B of memory, or even 4*256B, is pretty cheap, so it works well even for fairly short strings, where most of the count array isn't touched.
It should be fairly good for very long strings, at least in some cases. It's pretty heavily dependent on a good and a heavily pipelined L1 cache, because scattered increments to the count array produces scattered read-modify-writes. Repeated occurrences create a dependency chain of with a store-load round-trip in it. This is a big glass-jaw for this algorithm, on CPUs where many loads and stores can be in flight at once (with their latencies happening in parallel). Modern x86 CPUs should run it pretty well, since they can sustain a load + store every clock cycle.
The initial count of inputa compiles to a very tight loop:
.L15:
movsx rdx, BYTE PTR [rax]
add rax, 1
add DWORD PTR [rsp-120+rdx*4], 1
cmp rax, rcx
jne .L15
This brings us to the first major bug in your code: char can be signed or unsigned. In the x86-64 ABI, char is signed, so allChars[(int)inputa[i]]++; sign-extends it for use as an array index. (movsx instead of movzx). Your code will write outside the array bounds on non-ASCII characters that have their high bit set. So you should have written allChars[(unsigned char)inputa[i]]++;. Note that casting to (unsigned) doesn't give the result we want (see comments).
Note that clang makes much worse code (v3.7.1 and v3.8, both with -O3), with a function call to std::basic_string<...>::_M_leak_hard() inside the inner loop. (Leak as in leak a reference, I think.) #manlio's version doesn't have this problem, so I guess for (auto c : inputa) syntax helps clang figure out what's happening.
Also, using std::string when your callers have char[] forces them to construct a std::string. That's kind of silly, but it is helpful to be able to compare string lengths.
GNU libc's std::is_permutation uses a very different strategy:
First, it skips any common prefix that's identical without permutation in both strings.
Then, for each element in inputa:
count the occurrences of that element in inputb. Check that it matches the count in inputa.
There are a couple optimizations:
Only compare counts the first time an element is seen: find duplicates by searching from the beginning of inputa, and if the match position isn't the current position, we've already checked this element.
check that the match count in inputb is != 0 before counting matches in the rest of inputa.
This doesn't need any temporary storage, so it can work when the elements are large. (e.g. an array of int64_t, or an array of structs).
If there is a mismatch, this is likely to find it early, before doing as much work. There are probably a few cases of inputs where the counting version would take less time, but probably for most inputs the library algorithm is best.
std::is_permutation uses std::count, which should be implemented very well with SSE / AVX vectors. Unfortunately, it's auto-vectorized in a really stupid way by both gcc and clang. It unpacks bytes to 64bit integers before accumulating them in vector elements, to avoid overflow. So it spends most of its instructions shuffling data around, and is probably slower than a scalar implementation (which you'd get from compiling with -O2, or with -O3 -fno-tree-vectorize).
It could and should only do this every few iterations, so the inner loop of count can just be something like pcmpeqb / psubb, with a psadbw every 255 iterations. Or pcmpeqb / pmovmskb / popcnt / add, but that's slower.
Template specializations in the library could help a lot for std::count for 8, 16, and 32bit types whose equality can be checked with bitwise equality (integer ==).

Poor performance of vector<bool> in 64-bit target with VS2012

Benchmarking this class:
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= (int)sqrt((double)n); ++i)
if (isPrime[i])
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
}
};
I'm getting over 3 times worse performance (CPU time) with 64-bit binary vs. 32-bit version (release build) when calling a constructor for a large number, e.g.
Sieve s(100000000);
I tested sizeof(bool) and it is 1 for both versions.
When I substitute vector<bool> with vector<char> performance becomes the same for 64-bit and 32-bit versions. Why is that?
Here are the run times for S(100000000) (release mode, 32-bit first, 64-bit second)):
vector<bool> 0.97s 3.12s
vector<char> 0.99s 0.99s
vector<int> 1.57s 1.59s
I also did a sanity test with VS2010 (prompted by Wouter Huysentruit's response), which produced 0.98s 0.88s. So there is something wrong with VS2012 implementation.
I submitted a bug report to Microsoft Connect
EDIT
Many answers below comment on deficiencies of using int for indexing. This may be true, but even the Great Wizard himself is using a standard for (int i = 0; i < v.size(); ++i) in his books, so such a pattern should not incur a significant performance penalty. Additionally, this issue was raised during Going Native 2013 conference and the presiding group of C++ gurus commented on their early recommendations of using size_t for indexing and as a return type of size() as a mistake. They said: "we are sorry, we were young..."
The title of this question could be rephrased to: Over 3 times performance drop on this code when upgrading from VS2010 to VS2012.
EDIT
I made a crude attempt at finding memory alignment of indexes i and j and discovered that this instrumented version:
struct Sieve {
vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= sqrt((double)n); ++i) {
if (i == 17) cout << ((int)&i)%16 << endl;
if (isPrime[i])
for (int j = i*i; j <= n; j += i) {
if (j == 4) cout << ((int)&j)%16 << endl;
isPrime[j] = false;
}
}
}
};
auto-magically runs fast now (only 10% slower than 32-bit version). This and VS2010 performance makes it hard to accept a theory of optimizer having inherent problems dealing with int indexes instead of size_t.
std::vector<bool> is not directly at fault here. The performance difference is ultimately caused by your use of the signed 32-bit int type in your loops and some rather poor register allocation by the compiler. Consider, for example, your innermost loop:
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
Here, j is a 32-bit signed integer. When it is used in isPrime[j], however, it must be promoted (and sign-extended) to a 64-bit integer, in order to perform the subscript computation. The compiler can't just treat j as a 64-bit value, because that would change the behavior of the loop (e.g. if n is negative). The compiler also can't perform the index computation using the 32-bit quantity j, because that would change the behavior of that expression (e.g. if j is negative).
So, the compiler needs to generate code for the loop using a 32-bit j then it must generate code to convert that j to a 64-bit integer for the subscript computation. It has to do the same thing for the outer loop with i. Unfortunately, it looks like the compiler allocates registers rather poorly in this case(*)--it starts spilling temporaries to the stack, causing the performance hit you see.
If you change your program to use size_t everywhere (which is 32-bit on x86 and 64-bit on x64), you will observe that the performance is on par with x86, because the generated code needs only work with values of one size:
Sieve (size_t n = 1)
{
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (size_t i = 2; i <= static_cast<size_t>(sqrt((double)n)); ++i)
if (isPrime[i])
for (size_t j = i*i; j <= n; j += i)
isPrime[j] = false;
}
You should do this anyway, because mixing signed and unsigned types, especially when those types are of different widths, is perilous and can lead to unexpected errors.
Note that using std::vector<char> also "solves" the problem, but for a different reason: the subscript computation required for accessing an element of a std::vector<char> is substantially simpler than that for accessing an element of std::vector<bool>. The optimizer is able to generate better code for the simpler computations.
(*) I don't work on code generation, and I'm hardly an expert in either assembly or low-level performance optimization, but from looking at the generated code, and given that it is reported here that Visual C++ 2010 generates better code, I'd guess that there are opportunities for improvement in the compiler. I'll make sure the Connect bug you opened gets forwarded on to the compiler team so they can take a look.
I've tested this with vector<bool> in VS2010: 32-bit needs 1452ms while 64-bit needs 1264ms to complete on a i3.
The same test in VS2012 (on i7 this time) needs 700ms (32-bit) and 2730ms (64-bit), so there is something wrong with the compiler in VS2012. Maybe you can report this test case as a bug to Microsoft.
UPDATE
The problem is that the VS2012 compiler uses a temporary stack variable for a part of the code in the inner for-loop when using int as iterator. The assembly parts listed below are part of the code inside <vector>, in the += operator of the std::vector<bool>::iterator.
size_t as iterator
When using size_t as iterator, a part of the code looks like this:
or rax, -1
sub rax, rdx
shr rax, 5
lea rax, QWORD PTR [rax*4+4]
sub r8, rax
Here, all instructions use CPU registers which are very fast.
int as iterator
When using int as iterator, that same part looks like this:
or rcx, -1
sub rcx, r8
shr rcx, 5
shl rcx, 2
mov rax, -4
sub rax, rcx
mov rdx, QWORD PTR _Tmp$6[rsp]
add rdx, rax
Here you see the _Tmp$6 stack variable being used, which causes the slowdown.
Point compiler into the right direction
The funny part is that you can point the compiler into the right direction by using the vector<bool>::iterator directly.
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign(n + 1, true);
std::vector<bool>::iterator it1 = isPrime.begin();
std::vector<bool>::iterator end = it1 + n;
*it1++ = false;
*it1++ = false;
for (int i = 2; i <= (int)sqrt((double)n); ++it1, ++i)
if (*it1)
for (std::vector<bool>::iterator it2 = isPrime.begin() + i * i; it2 <= end; it2 += i)
*it2 = false;
}
};
vector<bool> is a very special container that's specialized to use 1 bit per item rather than providing normal container semantics. I suspect that the bit manipulation logic is much more expensive when compiling 64 bits (either it still uses 32 bit chunks to hold the bits or some other reason). vector<char> behaves just like a normal vector so there's no special logic.
You could also use deque<bool> which doesn't have the specialization.

Which one will be faster

Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.
There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).
Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.
In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.
The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.
For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);
The first one will be faster because you loop from 1 to 10000 only one time.
C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).
If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.
The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.
If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?