While trying to compute the variance of row vectors in large matrices I've noticed an odd behavior with Eigen. If I chain all the required operations I get extremely slow performance, meanwhile computing a partial result then performing the exact same operations yields much faster results. This behavior seems to actually go against the Eigen docs/FAQ which says to avoid temporaries.
So my question is if there is some kind of known pitfall in the library I should perhaps avoid, and how to spot situations where this type of slow down might occur.
Here's the code I've used to test this. I've tried compiling it with MSVC (-O2 optimizations) and MinGW GCC (-O3) on windows. The "row variance with partial eval" version runs at around 560ms with GCC and 1s with MSVC, while the version without the partial takes around 90s with GCC and 104s with MSVC, a pretty absurd difference. I didn't try it but I imagine even a sequence of naive for loops would be a lot faster than 90 seconds...
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <functional>
#include "Eigen/Dense"
void printTimespan(std::chrono::nanoseconds timeSpan)
{
using namespace std::chrono;
std::cout << "Timing ended:\n"
<< "\t ms: " << duration_cast<milliseconds>(timeSpan).count() << '\n'
<< "\t us: " << duration_cast<microseconds>(timeSpan).count() << '\n'
<< "\t ns: " << timeSpan.count() << '\n';
}
class Timer
{
std::chrono::steady_clock::time_point start_;
public:
void start()
{
start_ = std::chrono::steady_clock::now();
}
void stop()
{
timings.push_back((std::chrono::steady_clock::now() - start_).count());
}
std::vector<long long> timings;
};
std::vector<float> buildBuffer(size_t rows, size_t cols)
{
std::vector<float> buffer;
buffer.reserve(rows * cols);
for (size_t i = 0; i < rows; i++)
{
for (size_t j = 0; j < cols; j++)
{
buffer.push_back(std::rand() % 1000);
}
}
return buffer;
}
using EigenArr = Eigen::Array<float, -1, -1, Eigen::RowMajor>;
using EigenMap = Eigen::Map<EigenArr>;
std::vector<float> benchmark(std::function<EigenArr(const EigenMap&)> func)
{
constexpr size_t rows = 2000, cols = 200, repetitions = 1000;
std::vector<float> buffer = buildBuffer(rows, cols);
EigenMap map(buffer.data(), rows, cols);
EigenArr res;
std::vector<float> means; //just to prevent the compiler from not computing anything because the results aren't used
Timer timer;
for (size_t i = 0; i < repetitions; i++)
{
timer.start();
res = func(map);
timer.stop();
means.push_back(res.mean());
}
Eigen::Map<Eigen::Vector<long long, -1>> timingsMap(timer.timings.data(), timer.timings.size());
printTimespan(std::chrono::nanoseconds(timingsMap.sum()));
return means;
}
int main()
{
std::cout << "mean center rows\n";
benchmark([](const EigenMap& map)
{
return (map.colwise() - map.rowwise().mean()).eval();
});
std::cout << "squared deviations\n";
benchmark([](const EigenMap& map)
{
return (map.colwise() - map.rowwise().mean()).square().eval();
});
std::cout << "row variance with partial eval\n";
benchmark([](const EigenMap& map)
{
EigenArr partial = (map.colwise() - map.rowwise().mean()).square().eval();
return (partial.rowwise().sum() / (map.cols() - 1)).eval();
});
std::cout << "row variance\n";
benchmark([](const EigenMap& map)
{
return ((map.colwise() - map.rowwise().mean()).square().rowwise().sum() / (map.cols() - 1)).eval();
});
}
I suspect it's the double rowwise() on the slower one.
A lot of operations in Eigen are computed on demand, and don't create temporaries. This is done to prevent unnecessary copies of the data. But I suspect that every time the outer rowwise() is being asked for an element, it's computing the inner portion, squaring the number of operations. By saving a copy once, it prevents each cell being evaluated multiple times.
You could also do it on one line by calling .eval() after the square().
The other possibility is just a cache issue, if it's being forced to skip around in memory a lot.
Related
I use the eigen library to perform the sparse matrix operations, particularly, to fill a sparse matirx. But the rows and cols are very large in our case, which results in a long time for filling the sparse matrix. Is there any efficient way to do this (maybe by the other libraries)?
Below is the my code:
SparseMatrix mat(rows,cols);
mat.reserve(VectorXi::Constant(cols,6));
for each i,j such that v_ij != 0
mat.insert(i,j) = v_ij;
mat.makeCompressed();
The order in which a SparseMatrix is filled can make an enormous difference in computation time. To fill a SparseMatrix matrix quickly, the elements should be addressed in a sequence that corresponds to the storage order of the SparseMatrix. By default, the storage order in Eigen's SparseMatrix is column major, but it is easy to change this.
The following code demonstrates the time difference between a rowwise filling of two sparse matrices with different storage order. The square sparse matrices are relatively small and nominally identical. While the RowMajor matrix is almost instantly filled, it takes a much longer time (about 30 seconds on my desktop computer) in the case of ColMajor storage format.
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/SparseCore>
#include <random>
using namespace Eigen;
typedef SparseMatrix<double, RowMajor> SpMat_RM;
typedef SparseMatrix<double, ColMajor> SpMat_CM;
// compile with -std=c++11 -O3
int main() {
const int n = 1e4;
const int nnzpr = 50;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> randInt(0, n-1);
SpMat_RM m_RM(n,n);
m_RM.reserve(n);
SpMat_CM m_CM(n,n);
m_CM.reserve(n);
std::cout << "Row-wise filling of [" << n << " x " << n << "] sparse matrix (RowMajor) ..." << std::flush;
for (int i = 0; i < n; ++i) {
for (int j = 0; j < nnzpr; ++j) {
int col = randInt(gen);
double val = 1. ; // v_ij
m_RM.coeffRef(i,col) = val ;
}
}
m_RM.makeCompressed();
std::cout << "done." << std::endl;
std::cout << "Row-wise filling of [" << n << " x " << n << "] sparse matrix (ColMajor) ..." << std::flush;
for (int i = 0; i < n; ++i) {
for (int j = 0; j < nnzpr; ++j) {
int col = randInt(gen);
double val = 1.; // v_ij
m_CM.coeffRef(i,col) = val ;
}
}
m_CM.makeCompressed();
std::cout << "done." << std::endl;
}
I am trying to compare the performance of std::sort (using std::vector of structs) vs intel ipp sort.
I am running this test on an Intel Xeon processor model name : Intel(R) Xeon(R) CPU X5670 # 2.93GHz
I am sorting a vector of length 20000 elements and sorting 200 times. I have tried 2 diferent ipp sort routines viz. ippsSortDescend_64f_I and ippsSortRadixDescend_64f_I. In all cases, ipp sort was at least 5 to 10 times slower than std::sort. I was expecting the ipp sort maybe slower for smaller arrays but otherwise it should generally be faster than std::sort. Am I missing something here? What am I doing wrong?
std::sort is consistently faster in all my test cases.
Here is my program
#include <array>
#include <iostream>
#include <algorithm>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/timeb.h>
#include <vector>
#include <chrono>
#include "ipp.h"
using namespace std;
const int SIZE = 2000000;
const int ITERS = 200;
//Chrono typedefs
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds microseconds;
//////////////////////////////////// std ///////////////////////////////////
typedef vector<double> myList;
void initialize(myList & l, Ipp64f* ptr)
{
double randomNum;
for (int i = 0; i < SIZE; i++)
{
randomNum = 1.0 * rand() / (RAND_MAX / 2) - 1;
l.push_back(randomNum);
ptr[i] = randomNum;
}
}
void test_sort()
{
array<myList, ITERS> list;
array<Ipp64f*, ITERS> ippList;
// allocate
for(int i=0; i<ITERS;i++)
{
list[i].reserve(SIZE);
ippList[i] = ippsMalloc_64f(SIZE);
}
// initialize
for(int i=0;i<ITERS;i++)
{
initialize(list[i], ippList[i]);
}
cout << "\n\nTest Case 1: std::sort\n";
cout << "========================\n";
// sort vector
Clock::time_point t0 = Clock::now();
for(int i=0; i<ITERS;i++)
{
std::sort(list[i].begin(), list[i].end());
}
Clock::time_point t1 = Clock::now();
microseconds ms = std::chrono::duration_cast<microseconds>(t1 - t0);
std::cout << ms.count() << " micros" << std::endl;
////////////////////////////////// IPP ////////////////////////////////////////
cout << "\n\nTest Case 2: ipp::sort\n";
cout << "========================\n";
// sort ipp
Clock::time_point t2 = Clock::now();
for(int i=0; i<ITERS;i++)
{
ippsSortAscend_64f_I(ippList[i], SIZE);
}
Clock::time_point t3 = Clock::now();
microseconds ms1 = std::chrono::duration_cast<microseconds>(t3 - t2);
std::cout << ms1.count() << " micros" << std::endl;
for(int i=0; i<ITERS;i++)
{
ippsFree( ippList[i] );
}
}
///////////////////////////////////////////////////////////////////////////////////////
int main()
{
srand (time(NULL));
cout << "Test for sorting an array of structures.\n" << endl;
cout << "Test case: \nSort an array of structs ("<<ITERS<<" iterations) with double of length "<<SIZE<<". \n";
IppStatus status=ippInit();
test_sort();
return 0;
}
/////////////////////////////////////////////////////////////////////////////
compilation command is:
/share/intel/bin/icc -O2 -I$(IPPROOT)/include sorting.cpp -lrt -L$(IPPROOT)/lib/intel64 -lippi -lipps -lippvm -lippcore -std=c++0x
Program output:
Test for sorting an array of structures.
Test case:
Sort an array of structs (200 iterations) with double of length 2000000.
Test Case 1: std::sort
========================
38117024 micros
Test Case 2: ipp::sort
========================
48917686 micros
I have run your code on my computer (Core i7 860).
std::sort 32,763,268 (~33s)
ippsSortAscend_64f_I 34,217,517 (~34s)
ippsSortRadixAscend_64f_I 15,319,053 (~15s)
These are the expected results. std::sort is inline and highly optimized, while ippsSort_* has function call overhead and a lot of inner checks performed by all ipp functions. This should explain the little slowdown for ippsSortAscend function. Radix sort is still twice faster as expected, since it is not a comparison based sorting.
for more accurate result you need to
compare sorting of exactly the same distributions of random numbers;
remove randomize from timing;
use ippsSort*32f functions, to sort 'float' (not 'double') in IPP case.
I guess you've forgotten to call ippInit() before the measuremen
I was testing the performance between these two operations, and on G++ 4.7.3, the string::operator+= version is about 2 times faster. My simple test:
What can be the cause of such a big difference?
g++ -O2 --std=c++11
#include <iostream>
#include <ctime>
#include <string>
#include <vector>
using namespace std;
class Timer {
public:
Timer(const std::string &label)
:label_(label)
{
begin_clock_ = clock();
cout <<label<<"- Timer starts!"<<endl;
}
~Timer() {
clock_t clock_used = clock() - begin_clock_;
cout<<label_<<"- Clock used:"<<clock_used
<<" Time:"<<clock_used/CLOCKS_PER_SEC<<endl;
}
private:
clock_t begin_clock_;
string label_;
};
int str(int loop)
{
Timer t("str");
string s;
for(int i=0;i<loop;++i)
s+=(i%2);
return s.length();
}
int vec(int loop)
{
Timer t("vec");
vector<bool> v;
for(int i=0;i<loop;++i)
v.push_back(i%2);
return v.size();
}
int main()
{
int loop = 1000000000;
int s1=str(loop);
int s2=vec(loop);
cout <<"s1="<<s1<<endl;
cout <<"s2="<<s2<<endl;
}
Strings and vectors both store their content contiguously. If there's not enough room for adding a new element, the capacity must be increased (memory allocation) and the existing content must be moved to the new location.
Hence, the performance should depend significantly on the allocation strategy of your implementation. If one container reserves bigger chunks when the current capacity is exhausted, it will be more efficient (less allocation, less moving).
Of course, the results are implementation dependent. In my tests, for example, the vector implementation was one third faster than the string variant.
Here how to see the effect:
int str(int loop)
{
Timer t("str");
string s;
size_t capa = 0, ncapa, alloc = 0; // coutners for monitoring allocations
long long mw = 0; //
for(int i = 0; i<loop; ++i){
if((ncapa = s.capacity()) != capa) // check if capacity increased
{ //
capa = ncapa; alloc++; mw += s.size(); //
} //
s += (i % 2);
}
cout << "allocations: " << alloc << " and elements moved: " << mw << endl;
return s.length();
}
On my compiler for example, for strings I got a capacity of 2, 4, 8, ... when for vectors it started immediately at 32,64, ...
Now, this doesn't explain all. If you want to see what part of the performance comes from allocation policy and what from other factors, you can siimply pre-allocate your string (s.reserve(loop);) and your vector (v.reserve(loop);) before starting to add any elements.
Is using a vector of boolean values slower than a dynamic bitset?
I just heard about boost's dynamic bitset, and I was wondering is it worth
the trouble. Can I just use vector of boolean values instead?
A great deal here depends on how many Boolean values you're working with.
Both bitset and vector<bool> normally use a packed representation where a Boolean is stored as only a single bit.
On one hand, that imposes some overhead in the form of bit manipulation to access a single value.
On the other hand, that also means many more of your Booleans will fit in your cache.
If you're using a lot of Booleans (e.g., implementing a sieve of Eratosthenes) fitting more of them in the cache will almost always end up a net gain. The reduction in memory use will gain you a lot more than the bit manipulation loses.
Most of the arguments against std::vector<bool> come back to the fact that it is not a standard container (i.e., it does not meet the requirements for a container). IMO, this is mostly a question of expectations -- since it says vector, many people expect it to be a container (other types of vectors are), and they often react negatively to the fact that vector<bool> isn't a container.
If you're using the vector in a way that really requires it to be a container, then you probably want to use some other combination -- either deque<bool> or vector<char> can work fine. Think before you do that though -- there's a lot of (lousy, IMO) advice that vector<bool> should be avoided in general, with little or no explanation of why it should be avoided at all, or under what circumstances it makes a real difference to you.
Yes, there are situations where something else will work better. If you're in one of those situations, using something else is clearly a good idea. But, be sure you're really in one of those situations first. Anybody who tells you (for example) that "Herb says you should use vector<char>" without a lot of explanation about the tradeoffs involved should not be trusted.
Let's give a real example. Since it was mentioned in the comments, let's consider the Sieve of Eratosthenes:
#include <vector>
#include <iostream>
#include <iterator>
#include <chrono>
unsigned long primes = 0;
template <class bool_t>
unsigned long sieve(unsigned max) {
std::vector<bool_t> sieve(max, false);
sieve[0] = sieve[1] = true;
for (int i = 2; i < max; i++) {
if (!sieve[i]) {
++primes;
for (int temp = 2 * i; temp < max; temp += i)
sieve[temp] = true;
}
}
return primes;
}
// Warning: auto return type will fail with older compilers
// Fine with g++ 5.1 and VC++ 2015 though.
//
template <class F>
auto timer(F f, int max) {
auto start = std::chrono::high_resolution_clock::now();
primes += f(max);
auto stop = std::chrono::high_resolution_clock::now();
return stop - start;
}
int main() {
using namespace std::chrono;
unsigned number = 100000000;
auto using_bool = timer(sieve<bool>, number);
auto using_char = timer(sieve<char>, number);
std::cout << "ignore: " << primes << "\n";
std::cout << "Time using bool: " << duration_cast<milliseconds>(using_bool).count() << "\n";
std::cout << "Time using char: " << duration_cast<milliseconds>(using_char).count() << "\n";
}
We've used a large enough array that we can expect a large portion of it to occupy main memory. I've also gone to a little pain to ensure that the only thing that changes between one invocation and the other is the use of a vector<char> vs. vector<bool>. Here are some results. First with VC++ 2015:
ignore: 34568730
Time using bool: 2623
Time using char: 3108
...then the time using g++ 5.1:
ignore: 34568730
Time using bool: 2359
Time using char: 3116
Obviously, the vector<bool> wins in both cases--by around 15% with VC++, and over 30% with gcc. Also note that in this case, I've chosen the size to show vector<char> in quite favorable light. If, for example, I reduce number from 100000000 to 10000000, the time differential becomes much larger:
ignore: 3987474
Time using bool: 72
Time using char: 249
Although I haven't done a lot of work to confirm, I'd guess that in this case, the version using vector<bool> is saving enough space that the array fits entirely in the cache, while the vector<char> is large enough to overflow the cache, and involve a great deal of main memory access.
You should usually avoid std::vector<bool> because it is not a standard container. It's a packed version, so it breaks some valuable guarantees usually given by a vector. A valid alternative would be to use std::vector<char> which is what Herb Sutter recommends.
You can read more about it in his GotW on the subject.
Update:
As has been pointed out, vector<bool> can be used to good effect, as a packed representation improves locality on large data sets. It may very well be the fastest alternative depending on circumstances. However, I would still not recommend it by default since it breaks many of the promises established by std::vector and the packing is a speed/memory tradeoff which may be beneficial in both speed and memory.
If you choose to use it, I would do so after measuring it against vector<char> for your application. Even then, I'd recommend using a typedef to refer to it via a name which does not seem to make the guarantees which it does not hold.
#include "boost/dynamic_bitset.hpp"
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
int main(int, char*[])
{
auto gen = std::bind(std::uniform_int_distribution<>(0, 1), std::default_random_engine());
std::vector<char> randomValues(1000000);
for (char & randomValue : randomValues)
{
randomValue = static_cast<char>(gen());
}
// many accesses, few initializations
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
std::vector<bool> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Time taken1: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< " milliseconds" << std::endl;
auto start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
boost::dynamic_bitset<> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end2 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken2: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count()
<< " milliseconds" << std::endl;
auto start3 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
std::vector<char> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end3 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken3: " << std::chrono::duration_cast<std::chrono::milliseconds>(end3 - start3).count()
<< " milliseconds" << std::endl;
// few accesses, many initializations
auto start4 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
std::vector<bool> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end4 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken4: " << std::chrono::duration_cast<std::chrono::milliseconds>(end4 - start4).count()
<< " milliseconds" << std::endl;
auto start5 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
boost::dynamic_bitset<> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end5 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken5: " << std::chrono::duration_cast<std::chrono::milliseconds>(end5 - start5).count()
<< " milliseconds" << std::endl;
auto start6 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
std::vector<char> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end6 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken6: " << std::chrono::duration_cast<std::chrono::milliseconds>(end6 - start6).count()
<< " milliseconds" << std::endl;
return EXIT_SUCCESS;
}
Time taken1: 1821 milliseconds
Time taken2: 1722 milliseconds
Time taken3: 25 milliseconds
Time taken4: 1987 milliseconds
Time taken5: 1993 milliseconds
Time taken6: 10970 milliseconds
dynamic_bitset = std::vector<bool>
if you allocate many times but you only access the array that you created few times, go for std::vector<bool> because it has lower allocation/initialization time.
if you allocate once and access many times, go for std::vector<char>, because of faster access
Also keep in mind that std::vector<bool> is NOT safe to be used is in multithreading because you might write to different bits but it might be the same byte.
It appears that the size of a dynamic bitset cannot be changed:
"The dynamic_bitset class is nearly identical to the std::bitset class. The difference is that the size of the dynamic_bitset (the number of bits) is specified at run-time during the construction of a dynamic_bitset object, whereas the size of a std::bitset is specified at compile-time through an integer template parameter." (from http://www.boost.org/doc/libs/1_36_0/libs/dynamic_bitset/dynamic_bitset.html)
As such, it should be slightly faster since it will have slightly less overhead than a vector, but you lose the ability to insert elements.
UPDATE: I just realize that OP was asking about vector<bool> vs bitset, and my answer does not answer the question, but I think I should leave it, if you search for c++ vector bool slow, you end up here.
vector<bool> is terribly slow. At least on my Arch Linux system (you can probably get a better implementation or something... but I was really surprised). If anybody has any suggestions why this is so slow, I'm all ears! (Sorry for the blunt beginning, here's the more professional part.)
I've written two implementations of the SOE, and the 'close to metal' C implementation is 10 times faster. sievec.c is the C implementation, and sievestl.cpp is the C++ implementation. I just compiled with make (implicit rules only, no makefile): and the results were 1.4 sec for the C version, and 12 sec for the C++/STL version:
sievecmp % make -B sievec && time ./sievec 27
cc sievec.c -o sievec
aa 1056282
./sievec 27 1.44s user 0.01s system 100% cpu 1.455 total
and
sievecmp % make -B sievestl && time ./sievestl 27
g++ sievestl.cpp -o sievestl
1056282./sievestl 27 12.12s user 0.01s system 100% cpu 12.114 total
sievec.c is as follows:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long prime_t;
typedef unsigned long word_t;
#define LOG_WORD_SIZE 6
#define INDEX(i) ((i)>>(LOG_WORD_SIZE))
#define MASK(i) ((word_t)(1) << ((i)&(((word_t)(1)<<LOG_WORD_SIZE)-1)))
#define GET(p,i) (p[INDEX(i)]&MASK(i))
#define SET(p,i) (p[INDEX(i)]|=MASK(i))
#define RESET(p,i) (p[INDEX(i)]&=~MASK(i))
#define p2i(p) ((p)>>1) // (((p-2)>>1))
#define i2p(i) (((i)<<1)+1) // ((i)*2+3)
unsigned long find_next_zero(unsigned long from,
unsigned long *v,
size_t N){
size_t i;
for (i = from+1; i < N; i++) {
if(GET(v,i)==0) return i;
}
return -1;
}
int main(int argc, char *argv[])
{
size_t N = atoi(argv[1]);
N = 1lu<<N;
// printf("%u\n",N);
unsigned long *v = malloc(N/8);
for(size_t i = 0; i < N/64; i++) v[i]=0;
unsigned long p = 3;
unsigned long pp = p2i(p * p);
while( pp <= N){
for(unsigned long q = pp; q < N; q += p ){
SET(v,q);
}
p = p2i(p);
p = find_next_zero(p,v,N);
p = i2p(p);
pp = p2i(p * p);
}
unsigned long sum = 0;
for(unsigned long i = 0; i+2 < N; i++)
if(GET(v,i)==0 && GET(v,i+1)==0) {
unsigned long p = i2p(i);
// cout << p << ", " << p+2 << endl;
sum++;
}
printf("aa %lu\n",sum);
// free(v);
return 0;
}
sievestl.cpp is as follows:
#include <iostream>
#include <vector>
#include <sstream>
using namespace std;
inline unsigned long i2p(unsigned long i){return (i<<1)+1; }
inline unsigned long p2i(unsigned long p){return (p>>1); }
inline unsigned long find_next_zero(unsigned long from, vector<bool> v){
size_t N = v.size();
for (size_t i = from+1; i < N; i++) {
if(v[i]==0) return i;
}
return -1;
}
int main(int argc, char *argv[])
{
stringstream ss;
ss << argv[1];
size_t N;
ss >> N;
N = 1lu<<N;
// cout << N << endl;
vector<bool> v(N);
unsigned long p = 3;
unsigned long pp = p2i(p * p);
while( pp <= N){
for(unsigned long q = pp; q < N; q += p ){
v[q] = 1;
}
p = p2i(p);
p = find_next_zero(p,v);
p = i2p(p);
pp = p2i(p * p);
}
unsigned sum = 0;
for(unsigned long i = 0; i+2 < N; i++)
if(v[i]==0 and v[i+1]==0) {
unsigned long p = i2p(i);
// cout << p << ", " << p+2 << endl;
sum++;
}
cout << sum;
return 0;
}
So I am aware of this question, and others on SO that deal with issue, but most of those deal with the complexities of the data structures (just to copy here, linked this theoretically has O(
I understand the complexities would seem to indicate that a list would be better, but I am more concerned with the real world performance.
Note: This question was inspired by slides 45 and 46 of Bjarne Stroustrup's presentation at Going Native 2012 where he talks about how processor caching and locality of reference really help with vectors, but not at all (or enough) with lists.
Question: Is there a good way to test this using CPU time as opposed to wall time, and getting a decent way of "randomly" inserting and deleting elements that can be done beforehand so it does not influence the timings?
As a bonus, it would be nice to be able to apply this to two arbitrary data structures (say vector and hash maps or something like that) to find the "real world performance" on some hardware.
I guess if I were going to test something like this, I'd probably start with code something on this order:
#include <list>
#include <vector>
#include <algorithm>
#include <deque>
#include <time.h>
#include <iostream>
#include <iterator>
static const int size = 30000;
template <class T>
double insert(T &container) {
srand(1234);
clock_t start = clock();
for (int i=0; i<size; ++i) {
int value = rand();
T::iterator pos = std::lower_bound(container.begin(), container.end(), value);
container.insert(pos, value);
}
// uncomment the following to verify correct insertion (in a small container).
// std::copy(container.begin(), container.end(), std::ostream_iterator<int>(std::cout, "\t"));
return double(clock()-start)/CLOCKS_PER_SEC;
}
template <class T>
double del(T &container) {
srand(1234);
clock_t start = clock();
for (int i=0; i<size/2; ++i) {
int value = rand();
T::iterator pos = std::lower_bound(container.begin(), container.end(), value);
container.erase(pos);
}
return double(clock()-start)/CLOCKS_PER_SEC;
}
int main() {
std::list<int> l;
std::vector<int> v;
std::deque<int> d;
std::cout << "Insertion time for list: " << insert(l) << "\n";
std::cout << "Insertion time for vector: " << insert(v) << "\n";
std::cout << "Insertion time for deque: " << insert(d) << "\n\n";
std::cout << "Deletion time for list: " << del(l) << '\n';
std::cout << "Deletion time for vector: " << del(v) << '\n';
std::cout << "Deletion time for deque: " << del(d) << '\n';
return 0;
}
Since it uses clock, this should give processor time not wall time (though some compilers such as MS VC++ get that wrong). It doesn't try to measure the time for insertion exclusive of time to find the insertion point, since 1) that would take a bit more work and 2) I still can't figure out what it would accomplish. It's certainly not 100% rigorous, but given the disparity I see from it, I'd be a bit surprised to see a significant difference from more careful testing. For example, with MS VC++, I get:
Insertion time for list: 6.598
Insertion time for vector: 1.377
Insertion time for deque: 1.484
Deletion time for list: 6.348
Deletion time for vector: 0.114
Deletion time for deque: 0.82
With gcc I get:
Insertion time for list: 5.272
Insertion time for vector: 0.125
Insertion time for deque: 0.125
Deletion time for list: 4.259
Deletion time for vector: 0.109
Deletion time for deque: 0.109
Factoring out the search time would be somewhat non-trivial because you'd have to time each iteration separately. You'd need something more precise than clock (usually is) to produce meaningful results from that (more on the order or reading a clock cycle register). Feel free to modify for that if you see fit -- as I mentioned above, I lack motivation because I can't see how it's a sensible thing to do.
This is the program I wrote after watching that talk. I tried running each timing test in a separate process to make sure the allocators weren't doing anything sneaky to alter performance. I have amended the test allow timing of the random number generation. If you are concerned it is affecting the results significantly, you can time it and subtract out the time spent there from the rest of the timings. But I get zero time spent there for anything but very large N. I used getrusage() which I am pretty sure isn't portable to Windows but it would be easy to substitute in something using clock() or whatever you like.
#include <assert.h>
#include <algorithm>
#include <iostream>
#include <list>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
void f(size_t const N)
{
std::vector<int> c;
//c.reserve(N);
for (size_t i = 0; i < N; ++i) {
int r = rand();
auto p = std::find_if(c.begin(), c.end(), [=](int a) { return a >= r; });
c.insert(p, r);
}
}
void g(size_t const N)
{
std::list<int> c;
for (size_t i = 0; i < N; ++i) {
int r = rand();
auto p = std::find_if(c.begin(), c.end(), [=](int a) { return a >= r; });
c.insert(p, r);
}
}
int h(size_t const N)
{
int r;
for (size_t i = 0; i < N; ++i) {
r = rand();
}
return r;
}
double usage()
{
struct rusage u;
if (getrusage(RUSAGE_SELF, &u) == -1) std::abort();
return
double(u.ru_utime.tv_sec) + (u.ru_utime.tv_usec / 1e6) +
double(u.ru_stime.tv_sec) + (u.ru_stime.tv_usec / 1e6);
}
int
main(int argc, char* argv[])
{
assert(argc >= 3);
std::string const sel = argv[1];
size_t const N = atoi(argv[2]);
double t0, t1;
srand(127);
if (sel == "vector") {
t0 = usage();
f(N);
t1 = usage();
} else if (sel == "list") {
t0 = usage();
g(N);
t1 = usage();
} else if (sel == "rand") {
t0 = usage();
h(N);
t1 = usage();
} else {
std::abort();
}
std::cout
<< (t1 - t0)
<< std::endl;
return 0;
}
To get a set of results I used the following shell script.
seq=`perl -e 'for ($i = 10; $i < 100000; $i *= 1.1) { print int($i), " "; }'`
for i in $seq; do
vt=`./a.out vector $i`
lt=`./a.out list $i`
echo $i $vt $lt
done