g++ 1000 times slower than visual studio using lists? - c++

Consider the following code snippet:
#include <iostream>
#include <ctime>
#include <vector>
#include <list>
using namespace std;
#define NUM_ITER 100000
int main() {
clock_t t = clock();
std::list< int > my_list;
std::vector< std::list< int >::iterator > list_ptr;
list_ptr.reserve(NUM_ITER);
for(int i = 0; i < NUM_ITER; ++i) {
my_list.push_back(0);
list_ptr.push_back(--(my_list.end()));
}
while(my_list.size() > 0) {
my_list.erase(list_ptr[list_ptr.size()-1]);
list_ptr.pop_back();
}
cout << "Done in: " << 1000*(clock()-t)/CLOCKS_PER_SEC << " msec!" << endl;
}
When I compile and run it with visual studio, all optimizations enabled, I get the output:
Done in: 8 msec!
When I compile and run it with g++, using the flags
g++ main.cpp -pedantic -O2
I get the output
Done in: 7349 msec!
Which is rougly 1000 times slower. Why is that? According to the "cppreference" calling erase on a list is supposed to use up only constant time.
The code was compiled and executed on the same machine.

It might be that the implementation shipped by GCC doesn't store the size, and the one MSVC ships does. In this case the inner loop is O(n^2) with GCC, O(n) for MSVC.
Anyway, C++11 mandates that list::size is constant time, you may want to report this as a bug.

UPDATE Workaround:
You can avoid calling size() so many times:
size_t my_list_size = my_list.size();
while(my_list_size > 0) {
accum += *list_ptr[list_ptr.size()-1];
my_list.erase(list_ptr[list_ptr.size()-1]);
--my_list_size;
list_ptr.pop_back();
}
Now it reports 10 msec.
EDIT
Their list implementation isn't as efficient. I tried by replacing with:
#include <iostream>
#include <ctime>
#include <boost/container/vector.hpp>
#include <boost/container/list.hpp>
using namespace std;
#define NUM_ITER 100000
int main() {
clock_t t = clock();
boost::container::list< int > my_list;
boost::container::vector< boost::container::list< int >::iterator > list_ptr;
list_ptr.reserve(NUM_ITER);
for(int i = 0; i < NUM_ITER; ++i) {
my_list.push_back(rand());
list_ptr.push_back(--(my_list.end()));
}
unsigned long long volatile accum = 0;
while(my_list.size() > 0) {
accum += *list_ptr[list_ptr.size()-1];
my_list.erase(list_ptr[list_ptr.size()-1]);
list_ptr.pop_back();
}
cout << "Done in: " << 1000*(clock()-t)/CLOCKS_PER_SEC << " msec!" << endl;
cout << "Accumulated: " << accum << "\n";
}
This now runs in ~0ms on my machine, vs. ~7s using std::list on the same machine.
sehe#desktop:/tmp$ ./test
Done in: 0 msec!
Accumulated: 107345864261546

Related

Not getting any output from c++ program

#include <vector>
#include <cmath>
void print(std::vector <int> const& a) {
for (int i = 0; i < a.size(); i++) {
std::cout << a.at(i) << " ";
}
}
std::vector<int> factors(int n) {
std::vector<int> vec = {};
for (int i = 0; i < round(sqrt(n)); i++) {
if (size(factors(i)) == 0) {
vec.push_back(i);
}
std::cout << i;
}
return vec;
}
int main() {
std::vector<int> vec = factors(600851475143);
print(vec);
}
This is my C++ code for Project Euler #3.
New to C++, so my code might be completely wrong syntactically, however I am not getting any build errors (using Visual Studio).
Not getting any output however. I understand this could be my fault, and the program might just be running extremely slow. But I programmed this in python using the same iterative method and it worked perfectly with a fast runtime.
Edit:
I am however getting this message in the console:
D:\RANDOM PROGRAMMING STUFF\PROJECTEULER\c++\projecteuler3\x64\Debug\projecteuler3.exe (process 13552) exited with code 0.
Press any key to close this window . . .
If you enable your compiler warnings, you should see an overflow warning
prog.cc: In function 'int main()':
prog.cc:23:36: warning: overflow in conversion from 'long int' to 'int' changes value from '600851475143' to '-443946297' [-Woverflow]
23 | std::vector vec = factors(600851475143);
so what gets passed to factors is not 600851475143 but -443946297
Of course Gaurav has given the correct answer. This is how to fix it, change from int to unsigned long long, to allow for the biggest integers possible that are supported without any external libraries
#include <cmath>
#include <iostream>
#include <vector>
void print(std::vector<int> const& list){
for( auto const& item : list){
std::cout << item << "\n";
}
}
std::vector<int> factors(unsigned long long n) {
std::vector<int> vec = {};
for( int i = 0; i < std::round(std::sqrt(n)); ++i ){
if( factors(i).empty() ) {
vec.push_back(i);
}
//std::cout << i << ", ";
}
return vec;
}
int main() {
std::vector<int> vec = factors(600851475143ULL);
print(vec);
}
I have also done some other minor changes:
Change the foor loop of print to a rang-based for loop
Added a delimeter i printing
added std:: namespace to functions
replaced size(vec) == 0 with empty to improve readability
Another good habit is to compile with -Wall to enable more warnings and -Werror so you are actually forced to take care of all warnings instead of brushing them off.

std::sort vs intel ipp sort performance. what am I doing wrong?

I am trying to compare the performance of std::sort (using std::vector of structs) vs intel ipp sort.
I am running this test on an Intel Xeon processor model name : Intel(R) Xeon(R) CPU X5670 # 2.93GHz
I am sorting a vector of length 20000 elements and sorting 200 times. I have tried 2 diferent ipp sort routines viz. ippsSortDescend_64f_I and ippsSortRadixDescend_64f_I. In all cases, ipp sort was at least 5 to 10 times slower than std::sort. I was expecting the ipp sort maybe slower for smaller arrays but otherwise it should generally be faster than std::sort. Am I missing something here? What am I doing wrong?
std::sort is consistently faster in all my test cases.
Here is my program
#include <array>
#include <iostream>
#include <algorithm>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/timeb.h>
#include <vector>
#include <chrono>
#include "ipp.h"
using namespace std;
const int SIZE = 2000000;
const int ITERS = 200;
//Chrono typedefs
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds microseconds;
//////////////////////////////////// std ///////////////////////////////////
typedef vector<double> myList;
void initialize(myList & l, Ipp64f* ptr)
{
double randomNum;
for (int i = 0; i < SIZE; i++)
{
randomNum = 1.0 * rand() / (RAND_MAX / 2) - 1;
l.push_back(randomNum);
ptr[i] = randomNum;
}
}
void test_sort()
{
array<myList, ITERS> list;
array<Ipp64f*, ITERS> ippList;
// allocate
for(int i=0; i<ITERS;i++)
{
list[i].reserve(SIZE);
ippList[i] = ippsMalloc_64f(SIZE);
}
// initialize
for(int i=0;i<ITERS;i++)
{
initialize(list[i], ippList[i]);
}
cout << "\n\nTest Case 1: std::sort\n";
cout << "========================\n";
// sort vector
Clock::time_point t0 = Clock::now();
for(int i=0; i<ITERS;i++)
{
std::sort(list[i].begin(), list[i].end());
}
Clock::time_point t1 = Clock::now();
microseconds ms = std::chrono::duration_cast<microseconds>(t1 - t0);
std::cout << ms.count() << " micros" << std::endl;
////////////////////////////////// IPP ////////////////////////////////////////
cout << "\n\nTest Case 2: ipp::sort\n";
cout << "========================\n";
// sort ipp
Clock::time_point t2 = Clock::now();
for(int i=0; i<ITERS;i++)
{
ippsSortAscend_64f_I(ippList[i], SIZE);
}
Clock::time_point t3 = Clock::now();
microseconds ms1 = std::chrono::duration_cast<microseconds>(t3 - t2);
std::cout << ms1.count() << " micros" << std::endl;
for(int i=0; i<ITERS;i++)
{
ippsFree( ippList[i] );
}
}
///////////////////////////////////////////////////////////////////////////////////////
int main()
{
srand (time(NULL));
cout << "Test for sorting an array of structures.\n" << endl;
cout << "Test case: \nSort an array of structs ("<<ITERS<<" iterations) with double of length "<<SIZE<<". \n";
IppStatus status=ippInit();
test_sort();
return 0;
}
/////////////////////////////////////////////////////////////////////////////
compilation command is:
/share/intel/bin/icc -O2 -I$(IPPROOT)/include sorting.cpp -lrt -L$(IPPROOT)/lib/intel64 -lippi -lipps -lippvm -lippcore -std=c++0x
Program output:
Test for sorting an array of structures.
Test case:
Sort an array of structs (200 iterations) with double of length 2000000.
Test Case 1: std::sort
========================
38117024 micros
Test Case 2: ipp::sort
========================
48917686 micros
I have run your code on my computer (Core i7 860).
std::sort 32,763,268 (~33s)
ippsSortAscend_64f_I 34,217,517 (~34s)
ippsSortRadixAscend_64f_I 15,319,053 (~15s)
These are the expected results. std::sort is inline and highly optimized, while ippsSort_* has function call overhead and a lot of inner checks performed by all ipp functions. This should explain the little slowdown for ippsSortAscend function. Radix sort is still twice faster as expected, since it is not a comparison based sorting.
for more accurate result you need to
compare sorting of exactly the same distributions of random numbers;
remove randomize from timing;
use ippsSort*32f functions, to sort 'float' (not 'double') in IPP case.
I guess you've forgotten to call ippInit() before the measuremen

Why is std::vector erase very slow in release only when I step over with the debugger?

Ok let's start over
I'm trying to erase an element from a std::vector and for some reason it is very slow in release mode only when I step over.
Here is the complete source:
#include <iostream>
#include <vector>
#include <windows.h>
class data
{
public:
int i;
};
void Test(int n)
{
std::vector<data> v;
data d;
for (int i=0; i<n; ++i)
{
v.push_back(d);
}
ULONGLONG nTick = GetTickCount64();
v.erase(v.begin()+1);
std::cout << n << " " << GetTickCount64() - nTick << std::endl;
}
int main()
{
Test(10000);
Test(100000);
Test(1000000);
return 0;
}
When I step over the line
v.erase(v.begin()+1);
it take respectively in release
10000 -> 2 seconds
100000 -> 18 seconds
1000000 -> 182 seconds
but is pretty much instant in debug for all of them?

Comprehensive vector vs linked list benchmark for randomized insertions/deletions

So I am aware of this question, and others on SO that deal with issue, but most of those deal with the complexities of the data structures (just to copy here, linked this theoretically has O(
I understand the complexities would seem to indicate that a list would be better, but I am more concerned with the real world performance.
Note: This question was inspired by slides 45 and 46 of Bjarne Stroustrup's presentation at Going Native 2012 where he talks about how processor caching and locality of reference really help with vectors, but not at all (or enough) with lists.
Question: Is there a good way to test this using CPU time as opposed to wall time, and getting a decent way of "randomly" inserting and deleting elements that can be done beforehand so it does not influence the timings?
As a bonus, it would be nice to be able to apply this to two arbitrary data structures (say vector and hash maps or something like that) to find the "real world performance" on some hardware.
I guess if I were going to test something like this, I'd probably start with code something on this order:
#include <list>
#include <vector>
#include <algorithm>
#include <deque>
#include <time.h>
#include <iostream>
#include <iterator>
static const int size = 30000;
template <class T>
double insert(T &container) {
srand(1234);
clock_t start = clock();
for (int i=0; i<size; ++i) {
int value = rand();
T::iterator pos = std::lower_bound(container.begin(), container.end(), value);
container.insert(pos, value);
}
// uncomment the following to verify correct insertion (in a small container).
// std::copy(container.begin(), container.end(), std::ostream_iterator<int>(std::cout, "\t"));
return double(clock()-start)/CLOCKS_PER_SEC;
}
template <class T>
double del(T &container) {
srand(1234);
clock_t start = clock();
for (int i=0; i<size/2; ++i) {
int value = rand();
T::iterator pos = std::lower_bound(container.begin(), container.end(), value);
container.erase(pos);
}
return double(clock()-start)/CLOCKS_PER_SEC;
}
int main() {
std::list<int> l;
std::vector<int> v;
std::deque<int> d;
std::cout << "Insertion time for list: " << insert(l) << "\n";
std::cout << "Insertion time for vector: " << insert(v) << "\n";
std::cout << "Insertion time for deque: " << insert(d) << "\n\n";
std::cout << "Deletion time for list: " << del(l) << '\n';
std::cout << "Deletion time for vector: " << del(v) << '\n';
std::cout << "Deletion time for deque: " << del(d) << '\n';
return 0;
}
Since it uses clock, this should give processor time not wall time (though some compilers such as MS VC++ get that wrong). It doesn't try to measure the time for insertion exclusive of time to find the insertion point, since 1) that would take a bit more work and 2) I still can't figure out what it would accomplish. It's certainly not 100% rigorous, but given the disparity I see from it, I'd be a bit surprised to see a significant difference from more careful testing. For example, with MS VC++, I get:
Insertion time for list: 6.598
Insertion time for vector: 1.377
Insertion time for deque: 1.484
Deletion time for list: 6.348
Deletion time for vector: 0.114
Deletion time for deque: 0.82
With gcc I get:
Insertion time for list: 5.272
Insertion time for vector: 0.125
Insertion time for deque: 0.125
Deletion time for list: 4.259
Deletion time for vector: 0.109
Deletion time for deque: 0.109
Factoring out the search time would be somewhat non-trivial because you'd have to time each iteration separately. You'd need something more precise than clock (usually is) to produce meaningful results from that (more on the order or reading a clock cycle register). Feel free to modify for that if you see fit -- as I mentioned above, I lack motivation because I can't see how it's a sensible thing to do.
This is the program I wrote after watching that talk. I tried running each timing test in a separate process to make sure the allocators weren't doing anything sneaky to alter performance. I have amended the test allow timing of the random number generation. If you are concerned it is affecting the results significantly, you can time it and subtract out the time spent there from the rest of the timings. But I get zero time spent there for anything but very large N. I used getrusage() which I am pretty sure isn't portable to Windows but it would be easy to substitute in something using clock() or whatever you like.
#include <assert.h>
#include <algorithm>
#include <iostream>
#include <list>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
void f(size_t const N)
{
std::vector<int> c;
//c.reserve(N);
for (size_t i = 0; i < N; ++i) {
int r = rand();
auto p = std::find_if(c.begin(), c.end(), [=](int a) { return a >= r; });
c.insert(p, r);
}
}
void g(size_t const N)
{
std::list<int> c;
for (size_t i = 0; i < N; ++i) {
int r = rand();
auto p = std::find_if(c.begin(), c.end(), [=](int a) { return a >= r; });
c.insert(p, r);
}
}
int h(size_t const N)
{
int r;
for (size_t i = 0; i < N; ++i) {
r = rand();
}
return r;
}
double usage()
{
struct rusage u;
if (getrusage(RUSAGE_SELF, &u) == -1) std::abort();
return
double(u.ru_utime.tv_sec) + (u.ru_utime.tv_usec / 1e6) +
double(u.ru_stime.tv_sec) + (u.ru_stime.tv_usec / 1e6);
}
int
main(int argc, char* argv[])
{
assert(argc >= 3);
std::string const sel = argv[1];
size_t const N = atoi(argv[2]);
double t0, t1;
srand(127);
if (sel == "vector") {
t0 = usage();
f(N);
t1 = usage();
} else if (sel == "list") {
t0 = usage();
g(N);
t1 = usage();
} else if (sel == "rand") {
t0 = usage();
h(N);
t1 = usage();
} else {
std::abort();
}
std::cout
<< (t1 - t0)
<< std::endl;
return 0;
}
To get a set of results I used the following shell script.
seq=`perl -e 'for ($i = 10; $i < 100000; $i *= 1.1) { print int($i), " "; }'`
for i in $seq; do
vt=`./a.out vector $i`
lt=`./a.out list $i`
echo $i $vt $lt
done

Performance penalty using 'auto' keyword in Visual Studio 2010

Using the new auto keyword has degraded my code execution times. I narrowed the problem to the following simple code snippet:
#include <iostream>
#include <map>
#include <vector>
#include <deque>
#include <time.h>
using namespace std;
void func1(map<int, vector<deque<float>>>& m)
{
vector<deque<float>>& v = m[1];
}
void func2(map<int, vector<deque<float>>>& m)
{
auto v = m[1];
}
void main () {
map<int, vector<deque<float>>> m;
m[1].push_back(deque<float>(1000,1));
clock_t begin=clock();
for(int i = 0; i < 100000; ++i) func1(m);
cout << "100000 x func1: " << (((double)(clock() - begin))/CLOCKS_PER_SEC) << " sec." << endl;
begin=clock();
for(int i = 0; i < 100000; ++i) func2(m);
cout << "100000 x func2: " << (((double)(clock() - begin))/CLOCKS_PER_SEC) << " sec." << endl;
}
The output I get on my i7 / Win7 machine (Release mode; VS2010) is:
100000 x func1: 0.001 sec.
100000 x func2: 3.484 sec.
Can anyone explain why using auto results in such a different execution times?
Obviously, there is a simple workaround, i.e., stop using auto altogether, but I hope there is a better way to overcome this issue.
You are copying the vector to v.
Try this instead to create a reference
auto& v = ...
As Bo said, you have to use auto& instead of auto (Note, that there is also auto* for other cases). Here is an updated version of your code:
#include <functional>
#include <iostream>
#include <map>
#include <vector>
#include <deque>
#include <time.h>
using namespace std;
typedef map<int, vector<deque<float>>> FooType; // this should have a meaningful name
void func1(FooType& m)
{
vector<deque<float>>& v = m[1];
}
void func2(FooType& m)
{
auto v = m[1];
}
void func3(FooType& m)
{
auto& v = m[1];
}
void measure_time(std::function<void(FooType&)> func, FooType& m)
{
clock_t begin=clock();
for(int i = 0; i < 100000; ++i) func(m);
cout << "100000 x func: " << (((double)(clock() - begin))/CLOCKS_PER_SEC) << " sec." << endl;
}
void main()
{
FooType m;
m[1].push_back(deque<float>(1000,1));
measure_time(func1, m);
measure_time(func2, m);
measure_time(func3, m);
}
On my computer, it gives the following output:
100000 x func: 0 sec.
100000 x func: 3.136 sec.
100000 x func: 0 sec.