Why is accumulate faster than a simple for cycle?

Why is accumulate faster than a simple for cycle? - c++

I was testing algorithms and run into this weird behavior, when std::accumulate is faster than a simple for cycle.
Looking at the generated assembler I'm not much wiser :-) It seems that the for cycle is optimized into MMX instructions, while accumulate expands into a loop.
This is the code. The behavior manifests with -O3 optimization level, gcc 4.7.1
#include <vector>
#include <chrono>
#include <iostream>
#include <random>
#include <algorithm>
using namespace std;
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (size_t i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
long long tmp = 0;
for (size_t i = 0; i < vsize; i++)
{
tmp += x[i];
}
cout << "dry run " << tmp << endl;
auto start = chrono::high_resolution_clock::now();
long long suma = accumulate(x.begin(),x.end(),0);
auto end = chrono::high_resolution_clock::now();
cout << "Accumulate runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;
start = chrono::high_resolution_clock::now();
suma = 0;
for (size_t i = 0; i < vsize; i++)
{
suma += x[i];
}
end = chrono::high_resolution_clock::now();
cout << "Manual sum runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;
return 0;
}

When you pass the 0 to accumulate, you are making it accumulate using an int instead of a long long.
If you code your manual loop like this, it will be equivalent:
int sumb = 0;
for (size_t i = 0; i < vsize; i++)
{
sumb += x[i];
}
suma = sumb;
or you can call accumulate like this:
long long suma = accumulate(x.begin(),x.end(),0LL);

I have some different results using Visual Studio 2012
// original code
Accumulate runtime 93600 ms
Manual sum runtime 140400 ms
Note that the original std::accumulate code isn't equivalent to the for loop because the third parameter to std::accumulate is an int 0 value. It performs the summation using an int and only at the end stores the result in a long long. Changing the third parameter to 0LL forces the algorithm to use a long long accumulator and results in the following times.
// change std::accumulate initial value -> 0LL
Accumulate runtime 265200 ms
Manual sum runtime 140400 ms
Since the final result fits in an int I changed suma and std::accumulate back to using only int values. After this change the MSVC 2012 compiler was able to auto-vectorize the for loop and resulted in the following times.
// change suma from long long to int
Accumulate runtime 93600 ms
Manual sum runtime 46800 ms

After fixing the accumulate issue others noted I tested with both Visual Studio 2008 & 2010 and accumulate was indeed faster than the manual loop.
Looking at the disassembly I saw some additional iterator checking being done in the manual loop so I switched to just a raw array to eliminate it.
Here's what I ended up testing with:
#include <Windows.h>
#include <iostream>
#include <numeric>
#include <stdlib.h>
int main()
{
const size_t vsize = 100*1000*1000;
int* x = new int[vsize];
for (size_t i = 0; i < vsize; i++) x[i] = rand() % 1000;
LARGE_INTEGER start,stop;
long long suma = 0, sumb = 0, timea = 0, timeb = 0;
QueryPerformanceCounter( &start );
suma = std::accumulate(x, x + vsize, 0LL);
QueryPerformanceCounter( &stop );
timea = stop.QuadPart - start.QuadPart;
QueryPerformanceCounter( &start );
for (size_t i = 0; i < vsize; ++i) sumb += x[i];
QueryPerformanceCounter( &stop );
timeb = stop.QuadPart - start.QuadPart;
std::cout << "Accumulate: " << timea << " - " << suma << std::endl;
std::cout << " Loop: " << timeb << " - " << sumb << std::endl;
delete [] x;
return 0;
}
Accumulate: 633942 - 49678806711
Loop: 292642 - 49678806711
Using this code, the manual loop easily beats accumulate. The big difference is the compiler unrolled the manual loop 4 times, otherwise the generated code is almost identical.

Related

Why C++ array class is taking more time to operate on, than the C-style array?

I have written a simple code to compare the time taken to operate on the elements of two arrays (both of same size), one defined by C++ array class and the other by plain C-style array. The code I have used is
#include <iostream>
#include <array>
#include <chrono>
using namespace std;
const int size = 1E8;
const int limit = 1E2;
array<float, size> A;
float B[size];
int main () {
using namespace std::chrono;
//-------------------------------------------------------------------------------//
auto start = steady_clock::now();
for (int i = 0; i < limit; i++)
for (int j = 0; j < size; j++)
A.at(j) *= 1.;
auto end = steady_clock::now();
auto span = duration_cast<seconds> (end - start).count();
cout << "Time taken for array A is: " << span << " sec" << endl;
//-------------------------------------------------------------------------------//
start = steady_clock::now();
for (int i = 0; i < limit; i++)
for (int j = 0; j < size; j++)
B[j] *= 1.;
end = steady_clock::now();
span = duration_cast<seconds> (end - start).count();
cout << "Time taken for array B is: " << span << " sec" << endl;
//-------------------------------------------------------------------------------//
return 0;
}
which I have compiled and run with
g++ array.cxx
./a.out
The output I get is the following
Time taken for array A is: 52 sec
Time taken for array B is: 22 sec
Why does the C++ array class takes much longer to operate on?

The std::array::at member function does bounds-checking so, of course, there is some extra overhead. If you want a fairer comparison use std::array::operator[], just like the plain array.

C++ Unhandled exception for large vector/array

I keep getting an unhandled exception in my code and it has me stumped.
I am sure it is in the way I have my variables declared.
Basically I am attempting to create 3 arrays, M rows, N columns of random variables.
If I set my N = 1,000 and M = 10,000, not a problem.
If I then change M = 100,000 I get an Unhandled exception memory allocation error.
Can someone please help me understand why this is happening.
Parts of the code was written on VS2010. I have now moved on to VS2013, so any additional advice on the usage of newer functions would also be appreciated.
cheers,
#include <cmath>
#include <iostream>
#include <random>
#include <vector>
#include <ctime>
#include <ratio>
#include <chrono>
int main()
{
using namespace std::chrono;
steady_clock::time_point Start_Time = steady_clock::now();
unsigned int N; // Number of time Steps in a simulation
unsigned long int M; // Number of simulations (paths)
N = 1000;
M = 10000;
// Random Number generation setup
double RANDOM;
srand((unsigned int)time(NULL)); // Generator loop reset
std::default_random_engine generator(rand()); // Seed with RAND()
std::normal_distribution<double> distribution(0.0, 1.0); // Mean = 0.0, Variance = 1.0 ie Normal
std::vector<std::vector<double>> RandomVar_A(M, std::vector<double>(N)); // dw
std::vector<std::vector<double>> RandomVar_B(M, std::vector<double>(N)); // uncorrelated dz
std::vector<std::vector<double>> RandomVar_C(M, std::vector<double>(N)); // dz
// Generate random variables for dw
for (unsigned long int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
RANDOM = distribution(generator);
RandomVar_A[i][j] = RANDOM;
}
}
// Generate random variables for uncorrelated dz
for (unsigned long int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
RANDOM = distribution(generator);
RandomVar_B[i][j] = RANDOM;
}
}
// Generate random variables for dz
for (unsigned long int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
RANDOM = distribution(generator);
RandomVar_C[i][j] = RANDOM;
}
}
steady_clock::time_point End_Time = steady_clock::now();
duration<double> time_span = duration_cast<duration<double>>(End_Time - Start_Time);
//Clear Matricies
RandomVar_A.clear();
RandomVar_B.clear();
RandomVar_C.clear();
std::cout << std::endl;
std::cout << "its done";
std::cout << std::endl << std::endl;
std::cout << "Time taken : " << time_span.count() << " Seconds" << std::endl << std::endl;
std::cout << "End Of Program" << std::endl << std::endl;
system("pause");
return 0;
}
// *************** END OF PROGRAM ***************

Three 100,000 x 1,000 arrays of doubles represents 300 million doubles. Assuming 8 byte doubles, that's around 2.3 GB of memory. Most likely your process is by default limited to 2 GB on Windows (even if you have much more RAM installed on the machine). However, there are ways to allow your process to access a larger address space: Memory Limits for Windows.

I'm experienced something similar then my 32-bit application allocates more than 2Gb memory.
Your vectors require about 2.1Gb memory, so it might be same problem.
Try to change platform of your application to x64. This may solve problem.

Why MATLAB is faster than C++ in creating random numbers?

I have been using MATLAB for a while for my projects and I have almost never had an experience in C++.
I needed speed and I heard that C++ can be more efficient and faster than MATLAB. So I tried this:
I created a matrix of random numbers using rand(5000,5000) on MATLAB.
And in C++, I have initialized a 2D vector created 2 for loops each of them looping for 5000 times and each time. MATLAB was 4-5x faster, so I thought it is because matlab executes vectorized codes in parallel, then I written the C++ code using parallel_for. Here is the code:
#include "stdafx.h"
#include <iostream>
#include <vector>
#include <fstream>
#include <ppl.h>
using namespace std;
using namespace concurrency;
int main();
{
int a = 5000, b = 5000, j, k;
vector< vector<int> > vec(a, vector<imt>(b));
parallel_for(int(0), a, [&](int i) {
for (j = 0; j <b; j++)
{
vec[i][j] = rand();
}
});
}
So the code above is about 25% faster than MATLAB's rand(5000,5000) Yet C++ is using 100% of the CPU while MATLAB is using 30% of CPU.
So I forced MATLAB to use all of the CPU by running 3 instances of MATLAB using rand(5000,5000) and divided the time it takes by 3. It made MATLAB twice as fast as C++.
I wonder what am I missing? I know this is a tiny example but I need an answer to be sure to port my code to C++.
Current status:
When I write C++ code without parallel_for I get half of the MATLAB's speed with the same CPU usage. Yet people who gave answers say that they are almost the same. I do not understand what I am missing
here is a snapshot of the optimization menu

This is maybe no answer, but a litle hint.
The comparison might be a bit unfair due to the usage of vectors.
Here is a comparison I've written. Both take up roughly 100% of one of the four threads available. In both cases I create 5000x5000 random numbers and do this 100 times for timing
Matlab
function stackoverflow
tic
for i=1:100
A =rand(5000);
end
toc
Runtime: ~27.9 sec
C++
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
using namespace std;
int main(){
int N = 5000;
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(time(NULL));
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
Runtime: ~28.7 sec
So both examples run almost equally fast.

When you call rand(5000,5000) in Matlab, Matlab executes the command by calling Intel MKL library, which is a highly optimized library written in C/C++ with lots of hand-coded assembly.
MKL should be faster than any straightforward C++ implementation, but there is an overhead for Matlab to call external library. The net result is that, for random number generation in smaller sizes (less than 1K for instance), plain C/C++ implementation will be faster, but for larger sizes, Matlab will benefit from super optimized MKL.

After looking at #sonystarmap's answer, I added a few types of containers: double*, vector<double> and vector<vector<double> >. I also added tests where the "pointer-containers" are memset, since vector initialises all memory.
The C++ code was compiled with these optimization flag: -O3 -march=native
The results:
Matlab: Elapsed time is 28.457788 seconds.
C++:
T=23844.2ms
T=25161.5ms
T=25154ms
T=24197.3ms
T=24235.2ms
T=24166.1ms
I can essentially not find the large gain you mention.
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
#include <vector>
#include <cstring>
using namespace std;
int main(){
const int N = 5000;
{
vector<double> A(N*N);
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i*N+j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
vector<vector<double> > A(N);
for (int i=0;i<N;i++)
A[i] = vector<double>(N);
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double ** A = new double*[N];
for (int i=0;i<N;i++) {
A[i] = new double[N];
memset(A[i], 0, sizeof(double) * N);
}
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double * A = new double[N * N];
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i*N + j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double * A = new double[N * N];
memset(A, 0, sizeof(double) * N * N);
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i*N + j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
}

#include <vector>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cstring>
int main() {
const int N = 5000;
std::vector<int> A(N*N);
srand(0);
clock_t start = clock();
for(int k = 0; k < 100; ++k){
for(int i = 0; i < N * N; ++i) {
A[i] = rand();
}
}
std::cout << (clock()-start)/(double)(CLOCKS_PER_SEC/1000) << "ms" << "\n";
return 0;
}
Went from 25-27 seconds on my workstation without any optimization flag on the compiler to 21 seconds with
-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native -ffast-math

about function pointer: why the overhead time changes when the content of the function changes

here is the c++ code, and I use vs2013, release mode
#include <ctime>
#include <iostream>
void Tempfunction(double& a, int N)
{
a = 0;
for (double i = 0; i < N; ++i)
{
a += i;
}
}
int main()
{
int N = 1000; // from 1000 to 8000
double Value = 0;
auto t0 = std::time(0);
for (int i = 0; i < 1000000; ++i)
{
Tempfunction(Value, N);
}
auto t1 = std::time(0);
auto Tempfunction_time = t1-t0;
std::cout << "Tempfunction_time = " << Tempfunction_time << '\n';
auto TempfunctionPtr = &Tempfunction;
Value = 0;
t0 = std::time(0);
for (int i = 0; i < 1000000; ++i)
{
(*TempfunctionPtr)(Value, N);
}
t1 = std::time(0);
auto TempfunctionPtr_time = t1-t0;
std::cout << "TempfunctionPtr_time = " << TempfunctionPtr_time << '\n';
std::system("pause");
}
I change the value of N from 1000 to 8000, and record Tempfunction_time and TempfunctionPtr_time.
The results are weird:
N=1000 , Tempfunction_time=1, TempfunctionPtr_time=2;
N=2000 , Tempfunction_time=2, TempfunctionPtr_time=6;
N=4000 , Tempfunction_time=4, TempfunctionPtr_time=11;
N=8000 , Tempfunction_time=8, TempfunctionPtr_time=21;
TempfunctionPtr_time - Tempfunction_time is not constant,
and TempfunctionPtr_time = 2~3 * Tempfunction_time.
The difference should be a constant which is the overhead of function pointer.
What is wrong?
EDIT:
Assume VS2013 inlines Tempfunction if it it called by Tempfunction(), and does not inline it if it is called by (*TempfunctionPtr), then we can explain the difference. So, if that is true, why can not the compiler inline (*TempfunctionPtr) ?

I compiled the existing code with g++ on my Linux machine, and I found that the time was too short to be measured accurately in seconds, so rewrote it to use std::chrono to measure the time more precisely - I also had to "use" the variable Value (hence the "499500" being printed below), otherwise the compiler would completely optimise away the first loop. Then I get the following result:
Tempfunction_time = 1.47983
499500
TempfunctionPtr_time = 1.69183
499500
Now, the results I have are for GCC (version 4.6.3 - other versions are available and may give other results!), which is not the same compiler as Microsoft, so the results may differ - different compilers optimise code quite differently at times. I'm actually quite surprised that the compiler doesn't figure out that the result of TempFunction only needs calculating once. But hey, made it easier to write the benchmark without trickery.
My second observation is that, with my compiler, if I replaceint N=1000; with a loop for(int N=1000; N <= 8000; N *= 2) around the main code, there is no or very little difference between the two cases - I'm not entirely sure why, because the code looks identical (there is no call via a function-pointer, because the compiler knows that the function pointer is a constant), and TempFUnction gets inlined in both cases. (The same "equality" happens when N is other values than 1000 - so I'm far from sure what is going on here....
To actually measure the difference between a function pointer and direct function call, you would need to move TempFUnction into a separate file, and "hide" the actual value stored in TempFunctionPtr such that the compiler doesn't figure out exactly what you are doing.
In the end, I ended up with something like this:
typedef void (*FunPtr)(double &a, int N);
void Tempfunction(double& a, int N)
{
a = 0;
for (double i = 0; i < N; ++i)
{
a += i;
}
}
FunPtr GetFunPtr()
{
return &Tempfunction;
}
And the "main" code like this:
#include <iostream>
#include <chrono>
typedef void (*FunPtr)(double &a, int N);
extern void Tempfunction(double& a, int N);
extern FunPtr GetFunPtr();
int main()
{
for(int N = 1000; N <= 8000; N *= 2)
{
std::cout << "N=" << N << std::endl;
double Value = 0;
auto t0 = std::chrono::system_clock::now();
for (int i = 0; i < 1000000; ++i)
{
Tempfunction(Value, N);
}
auto t1 = std::chrono::system_clock::now();;
std::chrono::duration<double> Tempfunction_time = t1-t0;
std::cout << "Tempfunction_time = " << Tempfunction_time.count() << '\n';
std::cout << Value << std::endl;
auto TempfunctionPtr = GetFunPtr();
Value = 0;
t0 = std::chrono::system_clock::now();
for (int i = 0; i < 1000000; ++i)
{
(*TempfunctionPtr)(Value, N);
}
t1 = std::chrono::system_clock::now();
std::chrono::duration<double> TempfunctionPtr_time = t1-t0;
std::cout << "TempfunctionPtr_time = " << TempfunctionPtr_time.count() << '\n';
std::cout << Value << std::endl;
}
}
However, the difference is thousands of a second, and variant is a clear winner, the only conclusion is the obvious one, that "calling a function is slower than inlining it".
N=1000
Tempfunction_time = 1.78323
499500
TempfunctionPtr_time = 1.77822
499500
N=2000
Tempfunction_time = 3.54664
1.999e+06
TempfunctionPtr_time = 3.54687
1.999e+06
N=4000
Tempfunction_time = 7.0854
7.998e+06
TempfunctionPtr_time = 7.08706
7.998e+06
N=8000
Tempfunction_time = 14.1597
3.1996e+07
TempfunctionPtr_time = 14.1577
3.1996e+07
Of course, if we do "only half the hiding trick", so that the function is known and inlineable in the first case, and not known and through a function pointer, we can perhaps expect a difference. But calling a function through a pointer is in itself not expensive. The real difference comes when the compiler decides to inline the function.
Obviously, these are the results of GCC 4.6.3, which is not the same compiler as MSVS2013. You should make the "chrono" modifications that are in the above code, and see what difference it makes.

Is using a vector of boolean values slower than a dynamic bitset?

Is using a vector of boolean values slower than a dynamic bitset?
I just heard about boost's dynamic bitset, and I was wondering is it worth
the trouble. Can I just use vector of boolean values instead?

A great deal here depends on how many Boolean values you're working with.
Both bitset and vector<bool> normally use a packed representation where a Boolean is stored as only a single bit.
On one hand, that imposes some overhead in the form of bit manipulation to access a single value.
On the other hand, that also means many more of your Booleans will fit in your cache.
If you're using a lot of Booleans (e.g., implementing a sieve of Eratosthenes) fitting more of them in the cache will almost always end up a net gain. The reduction in memory use will gain you a lot more than the bit manipulation loses.
Most of the arguments against std::vector<bool> come back to the fact that it is not a standard container (i.e., it does not meet the requirements for a container). IMO, this is mostly a question of expectations -- since it says vector, many people expect it to be a container (other types of vectors are), and they often react negatively to the fact that vector<bool> isn't a container.
If you're using the vector in a way that really requires it to be a container, then you probably want to use some other combination -- either deque<bool> or vector<char> can work fine. Think before you do that though -- there's a lot of (lousy, IMO) advice that vector<bool> should be avoided in general, with little or no explanation of why it should be avoided at all, or under what circumstances it makes a real difference to you.
Yes, there are situations where something else will work better. If you're in one of those situations, using something else is clearly a good idea. But, be sure you're really in one of those situations first. Anybody who tells you (for example) that "Herb says you should use vector<char>" without a lot of explanation about the tradeoffs involved should not be trusted.
Let's give a real example. Since it was mentioned in the comments, let's consider the Sieve of Eratosthenes:
#include <vector>
#include <iostream>
#include <iterator>
#include <chrono>
unsigned long primes = 0;
template <class bool_t>
unsigned long sieve(unsigned max) {
std::vector<bool_t> sieve(max, false);
sieve[0] = sieve[1] = true;
for (int i = 2; i < max; i++) {
if (!sieve[i]) {
++primes;
for (int temp = 2 * i; temp < max; temp += i)
sieve[temp] = true;
}
}
return primes;
}
// Warning: auto return type will fail with older compilers
// Fine with g++ 5.1 and VC++ 2015 though.
//
template <class F>
auto timer(F f, int max) {
auto start = std::chrono::high_resolution_clock::now();
primes += f(max);
auto stop = std::chrono::high_resolution_clock::now();
return stop - start;
}
int main() {
using namespace std::chrono;
unsigned number = 100000000;
auto using_bool = timer(sieve<bool>, number);
auto using_char = timer(sieve<char>, number);
std::cout << "ignore: " << primes << "\n";
std::cout << "Time using bool: " << duration_cast<milliseconds>(using_bool).count() << "\n";
std::cout << "Time using char: " << duration_cast<milliseconds>(using_char).count() << "\n";
}
We've used a large enough array that we can expect a large portion of it to occupy main memory. I've also gone to a little pain to ensure that the only thing that changes between one invocation and the other is the use of a vector<char> vs. vector<bool>. Here are some results. First with VC++ 2015:
ignore: 34568730
Time using bool: 2623
Time using char: 3108
...then the time using g++ 5.1:
ignore: 34568730
Time using bool: 2359
Time using char: 3116
Obviously, the vector<bool> wins in both cases--by around 15% with VC++, and over 30% with gcc. Also note that in this case, I've chosen the size to show vector<char> in quite favorable light. If, for example, I reduce number from 100000000 to 10000000, the time differential becomes much larger:
ignore: 3987474
Time using bool: 72
Time using char: 249
Although I haven't done a lot of work to confirm, I'd guess that in this case, the version using vector<bool> is saving enough space that the array fits entirely in the cache, while the vector<char> is large enough to overflow the cache, and involve a great deal of main memory access.

You should usually avoid std::vector<bool> because it is not a standard container. It's a packed version, so it breaks some valuable guarantees usually given by a vector. A valid alternative would be to use std::vector<char> which is what Herb Sutter recommends.
You can read more about it in his GotW on the subject.
Update:
As has been pointed out, vector<bool> can be used to good effect, as a packed representation improves locality on large data sets. It may very well be the fastest alternative depending on circumstances. However, I would still not recommend it by default since it breaks many of the promises established by std::vector and the packing is a speed/memory tradeoff which may be beneficial in both speed and memory.
If you choose to use it, I would do so after measuring it against vector<char> for your application. Even then, I'd recommend using a typedef to refer to it via a name which does not seem to make the guarantees which it does not hold.

#include "boost/dynamic_bitset.hpp"
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
int main(int, char*[])
{
auto gen = std::bind(std::uniform_int_distribution<>(0, 1), std::default_random_engine());
std::vector<char> randomValues(1000000);
for (char & randomValue : randomValues)
{
randomValue = static_cast<char>(gen());
}
// many accesses, few initializations
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
std::vector<bool> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Time taken1: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< " milliseconds" << std::endl;
auto start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
boost::dynamic_bitset<> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end2 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken2: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count()
<< " milliseconds" << std::endl;
auto start3 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
std::vector<char> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end3 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken3: " << std::chrono::duration_cast<std::chrono::milliseconds>(end3 - start3).count()
<< " milliseconds" << std::endl;
// few accesses, many initializations
auto start4 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
std::vector<bool> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end4 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken4: " << std::chrono::duration_cast<std::chrono::milliseconds>(end4 - start4).count()
<< " milliseconds" << std::endl;
auto start5 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
boost::dynamic_bitset<> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end5 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken5: " << std::chrono::duration_cast<std::chrono::milliseconds>(end5 - start5).count()
<< " milliseconds" << std::endl;
auto start6 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
std::vector<char> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end6 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken6: " << std::chrono::duration_cast<std::chrono::milliseconds>(end6 - start6).count()
<< " milliseconds" << std::endl;
return EXIT_SUCCESS;
}
Time taken1: 1821 milliseconds
Time taken2: 1722 milliseconds
Time taken3: 25 milliseconds
Time taken4: 1987 milliseconds
Time taken5: 1993 milliseconds
Time taken6: 10970 milliseconds
dynamic_bitset = std::vector<bool>
if you allocate many times but you only access the array that you created few times, go for std::vector<bool> because it has lower allocation/initialization time.
if you allocate once and access many times, go for std::vector<char>, because of faster access
Also keep in mind that std::vector<bool> is NOT safe to be used is in multithreading because you might write to different bits but it might be the same byte.

It appears that the size of a dynamic bitset cannot be changed:
"The dynamic_bitset class is nearly identical to the std::bitset class. The difference is that the size of the dynamic_bitset (the number of bits) is specified at run-time during the construction of a dynamic_bitset object, whereas the size of a std::bitset is specified at compile-time through an integer template parameter." (from http://www.boost.org/doc/libs/1_36_0/libs/dynamic_bitset/dynamic_bitset.html)
As such, it should be slightly faster since it will have slightly less overhead than a vector, but you lose the ability to insert elements.

UPDATE: I just realize that OP was asking about vector<bool> vs bitset, and my answer does not answer the question, but I think I should leave it, if you search for c++ vector bool slow, you end up here.
vector<bool> is terribly slow. At least on my Arch Linux system (you can probably get a better implementation or something... but I was really surprised). If anybody has any suggestions why this is so slow, I'm all ears! (Sorry for the blunt beginning, here's the more professional part.)
I've written two implementations of the SOE, and the 'close to metal' C implementation is 10 times faster. sievec.c is the C implementation, and sievestl.cpp is the C++ implementation. I just compiled with make (implicit rules only, no makefile): and the results were 1.4 sec for the C version, and 12 sec for the C++/STL version:
sievecmp % make -B sievec && time ./sievec 27
cc sievec.c -o sievec
aa 1056282
./sievec 27 1.44s user 0.01s system 100% cpu 1.455 total
and
sievecmp % make -B sievestl && time ./sievestl 27
g++ sievestl.cpp -o sievestl
1056282./sievestl 27 12.12s user 0.01s system 100% cpu 12.114 total
sievec.c is as follows:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long prime_t;
typedef unsigned long word_t;
#define LOG_WORD_SIZE 6
#define INDEX(i) ((i)>>(LOG_WORD_SIZE))
#define MASK(i) ((word_t)(1) << ((i)&(((word_t)(1)<<LOG_WORD_SIZE)-1)))
#define GET(p,i) (p[INDEX(i)]&MASK(i))
#define SET(p,i) (p[INDEX(i)]|=MASK(i))
#define RESET(p,i) (p[INDEX(i)]&=~MASK(i))
#define p2i(p) ((p)>>1) // (((p-2)>>1))
#define i2p(i) (((i)<<1)+1) // ((i)*2+3)
unsigned long find_next_zero(unsigned long from,
unsigned long *v,
size_t N){
size_t i;
for (i = from+1; i < N; i++) {
if(GET(v,i)==0) return i;
}
return -1;
}
int main(int argc, char *argv[])
{
size_t N = atoi(argv[1]);
N = 1lu<<N;
// printf("%u\n",N);
unsigned long *v = malloc(N/8);
for(size_t i = 0; i < N/64; i++) v[i]=0;
unsigned long p = 3;
unsigned long pp = p2i(p * p);
while( pp <= N){
for(unsigned long q = pp; q < N; q += p ){
SET(v,q);
}
p = p2i(p);
p = find_next_zero(p,v,N);
p = i2p(p);
pp = p2i(p * p);
}
unsigned long sum = 0;
for(unsigned long i = 0; i+2 < N; i++)
if(GET(v,i)==0 && GET(v,i+1)==0) {
unsigned long p = i2p(i);
// cout << p << ", " << p+2 << endl;
sum++;
}
printf("aa %lu\n",sum);
// free(v);
return 0;
}
sievestl.cpp is as follows:
#include <iostream>
#include <vector>
#include <sstream>
using namespace std;
inline unsigned long i2p(unsigned long i){return (i<<1)+1; }
inline unsigned long p2i(unsigned long p){return (p>>1); }
inline unsigned long find_next_zero(unsigned long from, vector<bool> v){
size_t N = v.size();
for (size_t i = from+1; i < N; i++) {
if(v[i]==0) return i;
}
return -1;
}
int main(int argc, char *argv[])
{
stringstream ss;
ss << argv[1];
size_t N;
ss >> N;
N = 1lu<<N;
// cout << N << endl;
vector<bool> v(N);
unsigned long p = 3;
unsigned long pp = p2i(p * p);
while( pp <= N){
for(unsigned long q = pp; q < N; q += p ){
v[q] = 1;
}
p = p2i(p);
p = find_next_zero(p,v);
p = i2p(p);
pp = p2i(p * p);
}
unsigned sum = 0;
for(unsigned long i = 0; i+2 < N; i++)
if(v[i]==0 and v[i+1]==0) {
unsigned long p = i2p(i);
// cout << p << ", " << p+2 << endl;
sum++;
}
cout << sum;
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is accumulate faster than a simple for cycle? - c++

Related

Why C++ array class is taking more time to operate on, than the C-style array?

C++ Unhandled exception for large vector/array

Why MATLAB is faster than C++ in creating random numbers?

about function pointer: why the overhead time changes when the content of the function changes

Is using a vector of boolean values slower than a dynamic bitset?

Categories

Resources