I'm trying to find the number of prime numbers below 400 million but even with just 40 million my code is taking 8 secs to run. what am i doing wrong?
what can i do to make it faster?
#include<iostream>
#include<math.h>
#include<vector>
using namespace std;
int main()
{
vector<bool> k;
vector<long long int> c;
for (int i=2;i<40000000;i++)
{
k.push_back(true);
c.push_back(i);
}
for ( int i=0;i<sqrt(40000000)+1;i++)
{
if (k[i]==true)
{
for (int j=i+c[i];j<40000000;j=j+c[i])
{
k[j]=false;
}
}
}
vector <long long int> arr;
for ( int i=0;i<40000000-2;i++)
{
if (k[i]==true)
{
arr.push_back(c[i]);
}
}
cout << arr.size() << endl ;
return 0;
}
I profiled your code as well as a simple tweak, below. The tweak is more than twice as fast:
auto start = std::chrono::high_resolution_clock::now();
//original version
vector<bool> k;
vector<long long int> c;
for (int i=2;i<40000000;i++)
{
k.push_back(true);
c.push_back(i);
}
for ( int i=0;i<sqrt(40000000)+1;i++)
{
if (k[i]==true)
{
for (int j=i+c[i];j<40000000;j=j+c[i])
{
k[j]=false;
}
}
}
vector <long long int> arr;
for ( int i=0;i<40000000-2;i++)
{
if (k[i]==true)
{
arr.push_back(c[i]);
}
}
cout << arr.size() << endl ;
auto end1 = std::chrono::high_resolution_clock::now();
std::cout << "Elapsed = " <<
std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start).count() <<
std::endl;
}
{
auto begin = std::chrono::high_resolution_clock::now();
//new version
const long limit{40000000};
vector<bool> k(limit-1,true);
//k[0] is the number 0
k[0]=false; k[1]=false;
auto sq = sqrt(limit) + 1;
//start at the number 2
for ( int i=2;i<sq;i++)
{
if (k[i]==true)
{
for (int j=i+i;j<limit;j+=i)
{
k[j]=false;
}
}
}
vector <long long int> arr;
for ( int i=0;i<limit-2;i++)
{
if (k[i]==true)
{
arr.push_back(i);
}
}
cout << arr.size() << endl ;
auto stop = std::chrono::high_resolution_clock::now();
std::cout << "Elapsed = " <<
std::chrono::duration_cast<std::chrono::milliseconds>(stop - begin).count() <<
std::endl;
}
Here is the output (elapsed in milliseconds), in Debug mode:
2433654
Elapsed = 5787
2433654
Elapsed = 2432
Both have same results, second is much faster.
Here is another version using some nice C++ features (requiring less code), and it is about 11% faster than the second version above:
auto begin = std::chrono::high_resolution_clock::now();
const long limit{40000000};
vector<int> k(limit-1,0);
//fill with sequence of integers
std::iota(k.begin(),k.end(),0);
//k[0] is the number 0
//integers reset to 0 are not prime
k[0]=0; k[1]=0;
auto sq = sqrt(limit) + 1;
//start at the number 2
for (int i=2;i<sq;i++)
{
if (k[i])
{
for (int j=i+i;j<limit;j+=i)
{
k[j]=0;
}
}
}
auto results = std::remove(k.begin(),k.end(),0);
cout << results - k.begin() << endl ;
auto stop = std::chrono::high_resolution_clock::now();
std::cout << "Elapsed = " <<
std::chrono::duration_cast<std::chrono::milliseconds>(stop - begin).count() <<
std::endl;
}
Note that in your original version, you push_back in three different places, while this use of modern idioms never uses push_back at all when operating on the vectors.
In this example, the vector is of ints so that you have the actual list of prime numbers when you are finished.
Output:
2433654
Elapsed = 2160
These above are all Debug mode numbers.
In Release mode, the best is a combination of the second and third techniques above, using the numeric with a vector of bools, if you don't care what the actual prime numbers are in the end:
2433654
Elapsed = 1098
2433654
Elapsed bool remove= 410
2433654
Elapsed = 779
Note that your original code only takes about 1 second on my 5 year-old laptop in Release mode, so you are probably running in Debug mode.
I got it down from taking 10 seconds to run to just half a second on my computer by changing two things. First, I'm guessing you didn't compile it with optimization enabled. That brought it from 10 seconds down to 1 second for me. Second, the vector c is unnecessary. Everywhere you have c[i] in your code you can replace it with i+2. This will make it run twice as fast.
Remove vector c, you don't need it.
Create vector k with known size at start. Repeatedly appending elements to a vector by invoking push_back() is a really bad idea from a performance point of view, as it can cause repeated memory reallocations and copies.
http://primesieve.org/segmented_sieve.html - segmented version for inspiration.
You can skip processing multiples of 2 and 3. link from code review
It looks that you've got some issue in compiler optimization flag settings. Maybe you didn't change configuration from debug to release. What is your release speedup vs debug one?
Related
I am trying to implement shell sorting algorithm myself. I wrote my own code and didn't watch to any code samples only watch the video of algorithm description
My sort works but very slow (bubble sort 100 items - 0.007 s; shell sort 100 items - 4.83 s), how is it possible to improve it?
void print(vector<float>vec)
{
for (float i : vec)
cout << i << " ";
cout << "\n\n";
}
void Shell_sorting(vector<float>&values)
{
int swapping = 0;
int step = values.size();
clock_t start;
double duration;
start = clock();
while (step/2 >= 1)
{
step /= 2;
for (int i = 0; i < values.size()-step; i++)
{
if ((i + step < values.size()))
{
if ((values[i + step] < values[i]))
{
swap(values[i], values[i + step]);
print(values);
++swapping;
int c = i;
while (c - step > 0)
{
if (values[c] < values[c - step])
{
swap(values[c], values[c - step]);
print(values);
++swapping;
c -= step;
}
else
break;
}
}
}
else
break;
}
}
duration = (clock() - start) / (double)CLOCKS_PER_SEC;
print(values);
cout << swapping << " " << duration;
print(values);
}
A better implementation could be:
#include <iostream>
#include <vector>
int main()
{
std::vector<int> vec = {
726,621,81,719,167,958,607,130,263,108,
134,235,508,407,153,162,849,923,996,975,
250,78,460,667,654,62,865,973,477,912,
580,996,156,615,542,655,240,847,613,497,
274,241,398,84,436,803,138,677,470,606,
226,593,620,396,460,448,198,958,566,599,
762,248,461,191,933,805,288,185,21,340,
458,592,703,303,509,55,190,318,310,189,
780,923,933,546,816,627,47,377,253,709,
992,421,587,768,908,261,946,75,682,948,
};
std::vector<int> gaps = {5, 2, 1};
int j;
for (int gap : gaps) {
for (int i = gap; i < vec.size(); i++)
{
j = i-gap;
while (j >= 0) {
if (vec[j+gap] < vec[j])
{
int temp = vec[j+gap];
vec[j+gap] = vec[j];
vec[j] = temp;
j = j-gap;
}
else break;
}
}
}
for (int item : vec) std::cout << item << " " << std::endl;
return 0;
}
I prefer to use a vector to store gap data so that you do not need to compute the division (which is an expansive operation). Besides, this choice, gives your code more flexibility.
the extern loop cycles on gap values. Once choosen the gap, you iterate over your vector, starting from vec[gap] and explore if there are elements smaller then it according to the logic of the Shell Sort.
So, you start setting j=i-gap and test the if condition. If it is true, swap items and then repeat the while loop decrementing j. Note: vec[j+gap]is the element that in the last loop cycle was swapped. If the condition is true, there's no reason to continue in the loop, so you can exit from it with a break.
On my machine, it took 0.002s calculated using the time shell command (the time includes the process of printing numbers).
p.s. to generate all that numbers and write them in the array, since i'm too lazy to write a random function, i used this link and then i edited the output in the shell with:
sed -e 's/[[:space:]]/,/g' num | sed -e 's/$/,/'
I'm trying to understand possible optimization methods for the bubble sort algorithm. I know there are better sorting methods, but I'm just curious.
To test the efficiency I'm using std::chrono. The program sorts a 10000 number long int array 30 times and prints the average sorting time. The numbers are picked randomly(up to 10000) in every iteration. Here is the code, with no optimization:
#include <iostream>
#include <ctime>
#include <chrono>
using namespace std;
int main() {
//bubble sort
srand(time(NULL));
chrono::time_point<chrono::steady_clock> start, end;
const int n = 10000;
int i,j, last, tests = 30,arr[n];
long long total = 0;
bool out;
while (tests-->0) {
for (i = 0; i < n; i++) {
arr[i] = rand() % 1000;
}
j = n;
start = chrono::high_resolution_clock::now();
while(1){
out = 0;
for (i = 0; i < j - 1; i++) {
if (arr[i + 1] < arr[i]) {
swap(arr[i + 1], arr[i]);
out = 1;
}
}
if (!out) {
break;
}
//j--;
}
end = chrono::high_resolution_clock::now();
total += chrono::duration_cast<chrono::nanoseconds>(end - start).count();
cout << "Remaining :"<<tests << endl;
}
cout << "Average :" << total / static_cast<double>(30)/1000000000<<" seconds"; // tests(30) + nanosec -> sec
cin.sync();
cin.ignore();
return 0;
}
I get 0.17 seconds average sorting time.
If I uncomment line 47(j--;) to avoid comparing numbers already sorted I get 0.12 sorting time which is understandable.
If I remember the last position where a swap took place, I know that after that index, elements are sorted, and can thus sort up to that position in further iterations. It's better explained in the second part of this post: https://stackoverflow.com/a/16196115/1967496.
This is the code that implements the new possible optimization:
#include <iostream>
#include <ctime>
#include <chrono>
using namespace std;
int main() {
//bubble sort
srand(time(NULL));
chrono::time_point<chrono::steady_clock> start, end;
const int n = 10000;
int i,j, last, tests = 30,arr[n];
long long total = 0;
bool out;
while (tests-->0) {
for (i = 0; i < n; i++) {
arr[i] = rand() % 1000;
}
j = n;
start = chrono::high_resolution_clock::now();
while(1){
out = 0;
for (i = 0; i < j - 1; i++) {
if (arr[i + 1] < arr[i]) {
swap(arr[i + 1], arr[i]);
out = 1;
last = i;
}
}
if (!out) {
break;
}
j = last + 1;
}
end = chrono::high_resolution_clock::now();
total += chrono::duration_cast<chrono::nanoseconds>(end - start).count();
cout << "Remaining :"<<tests << endl;
}
cout << "Average :" << total / static_cast<double>(30)/1000000000<<" seconds"; // tests(30) + nanosec -> sec
cin.sync();
cin.ignore();
return 0;
}
Note lines 40 and 48. And here comes the problem: The average time is now again around 0.17 seconds.
Is there a problem in my code, or am I missing something ?
Update:
I did sorting with 10 times more numbers and get now following results:
No optimization: 19.3 seconds
First optimization(j--): 14.5 seconds
Second (supposed) optimization(j=last+1): 17.4 seconds;
From my understanding, the second method should be in any case better than the first, but the numbers tell something else.
Well... The problem is that there might not be the right or wrong answer to this question.
First of all, when you're comparing only 10000 elements, you cannot really call it an effeciency test. Try comparing much higher number of elements - maybe 500000 (although you will probably need to alocate an array dynamicaly for that).
Second of all, it might be the compiler. Compilers often try to optimize things so that the program execution will run smoother and faster.
The typical Towers of Hanoi Problem Solver would be the following:
void hanoi(int diskNumber , int start, int temp, int finish)
{
if(diskNumber == 1)
{
cout<< " Move Disk " << diskNumber<<" from " << start <<" to "<< finish<<endl;
}
else
{
hanoi(diskNumber-1,start,temp,finish);
cout<<"Move Disk from " << start <<" to "<<finish<<endl;
hanoi(diskNumber - 1,temp,start,finish);
}
}
But what I want to do is calculating the time that the algorithm runs. thus:
int main
{
//Hanoi:
cout<<"Hanoi Tower Problem:"<<endl;
//3 Disks:
clock_t htimer3 = clock();
hanoi(3, 1,2,3);
cout<<"CPU Time for n = 3 is: "
<<clock() - htimer3/CLOCKS_PER_SEC<<endl;
//6 Disks:
clock_t htimer6 = clock();
hanoi(6, 1,2,3);
cout<<"CPU Time for n = 6 is: "
<<clock() - htimer6/CLOCKS_PER_SEC<<endl;
//9 Disks:
clock_t htimer9 = clock();
hanoi(9, 1,2,3);
cout<<"CPU Time for n = 9 is: "
<<clock() - htimer9/CLOCKS_PER_SEC<<endl;
//12 Disks:
clock_t htimer12 = clock();
hanoi(12, 1,2,3);
cout<<"CPU Time for n = 12 is: "
<<clock() - htimer12/CLOCKS_PER_SEC<<endl;
//15 Disks:
clock_t htimer15 = clock();
hanoi(15, 1,2,3);
cout<<"CPU Time for n = 15 is: "
<<clock() - htimer15/CLOCKS_PER_SEC<<endl;
//End of Hanoi Tower Problem
return 0;
}
the problem here is that for example if I set the diskNumber = 15, the code's gonna run for 32767 times which will fill out the terminal window and I'll lose generated lines that comes before it (I have to calculate some other algorithms like bubble sort quick sort etc. I'm gonna use the numbers to draw a chart later to represent their Big O, i.e: Big O of Towers of Hanoi Algorithm is 2^n) .
To solve this problem I modified the code:
void hanoi(int diskSize, int start, int finish, int temp)
{
if(diskSize == 1)
{
return;
}
else
{
hanoi(diskSize - 1, start, temp, finish);
hanoi(diskSize - 1, temp, finish, start);
}
}
My main question: does the modified code, take the same running time as if it was the original algorithm? if not, what should do? any suggestions?
Yes, the time complexity of your modified code is same as the previous one since cout takes constant time. So your running time will not be affected much (granularity would be in order of nanoseconds) considering the stream you're writing to.
I would recomment redirecting the output to a file.
For example:
./executable > FileName
Using VexCL in C++ I am trying to count all values in a vector above a certain minimum and I would like to perform this count on the device. The default Reductors only provide methods for MIN, MAX and SUM and the examples do not show very clear how to perform such a operation. This code is slow as it is probably executed on the host instead of the device:
int amount = 0;
int minimum = 5;
for (vex::vector<int>::iterator i = vector.begin(); i != vector.end(); ++i)
{
if (*i >= minimum)
{
amount++;
}
}
The vector I am using will consists of a large amount of values, say millions and mostly zero's. Besides the amount of values that are above the minimum, I also would like to retrieve a list of vector-ID's which contains these values. Is this possible?
If you only needed to count elements above the minimum, this would be as simple as
vex::Reductor<int, vex::SUM> sum(ctx);
int amount = sum( vec >= minimum );
The vec >= minimum expression results in a sequence of ones and zeros, and sum then counts ones.
Now, since you also need to get the positions of the elements above the minimum, it gets a bit more complicated:
#include <iostream>
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
// Input vector
vex::vector<int> vec(ctx, {1, 3, 5, 2, 6, 8, 0, 2, 4, 7});
int n = vec.size();
int minimum = 5;
// Put result of (vec >= minimum) into key, and element indices into pos:
vex::vector<int> key(ctx, n);
vex::vector<int> pos(ctx, n);
key = (vec >= minimum);
pos = vex::element_index();
// Get number of interesting elements in vec.
vex::Reductor<int, vex::SUM> sum(ctx);
int amount = sum(key);
// Sort pos by key in descending order.
vex::sort_by_key(key, pos, vex::greater<int>());
// First 'amount' of elements in pos now hold indices of interesting
// elements. Lets use slicer to extract them:
vex::vector<int> indices(ctx, amount);
vex::slicer<1> slice(vex::extents[n]);
indices = slice[vex::range(0, amount)](pos);
std::cout << "indices: " << indices << std::endl;
}
This gives the following output:
indices: {
0: 2 4 5 9
}
#ddemidov
Thanks for your help, it is working. However, it is much slower than my original code which copies the device vector to the host and sorts using Boost. Below is the sample code with some timings:
#include <iostream>
#include <cstdio>
#include <vexcl/vexcl.hpp>
#include <vector>
#include <boost/range/algorithm.hpp>
int main()
{
clock_t start, end;
// initialize vector with random numbers
std::vector<int> hostVector(1000000);
for (int i = 0; i < hostVector.size(); ++i)
{
hostVector[i] = rand() % 20 + 1;
}
// copy to device
vex::Context cpu(vex::Filter::Type(CL_DEVICE_TYPE_CPU) && vex::Filter::Any);
vex::Context gpu(vex::Filter::Type(CL_DEVICE_TYPE_GPU) && vex::Filter::Any);
vex::vector<int> vectorCPU(cpu, 1000000);
vex::vector<int> vectorGPU(gpu, 1000000);
copy(hostVector, vectorCPU);
copy(hostVector, vectorGPU);
// sort results on CPU
start = clock();
boost::sort(hostVector);
end = clock();
cout << "C++: " << (end - start) / (CLOCKS_PER_SEC / 1000) << " ms" << endl;
// sort results on OpenCL
start = clock();
vex::sort(vectorCPU, vex::greater<int>());
end = clock();
cout << "vexcl CPU: " << (end - start) / (CLOCKS_PER_SEC / 1000) << " ms" << endl;
start = clock();
vex::sort(vectorGPU, vex::greater<int>());
end = clock();
cout << "vexcl GPU: " << (end - start) / (CLOCKS_PER_SEC / 1000) << " ms" << endl;
return 0;
}
which results in:
C++: 17 ms
vexcl CPU: 737 ms
vexcl GPU: 1670 ms
using an i7 3770 CPU and a (slow) HD4650 graphics card. As I'v read OpenCL should be able to perform fast sortings on large vertices. Do you have any advice how to perform a fast sort using OpenCL and vexcl?
I recently became very interested in prime numbers and tried making programs to calculate them. I was able to make a sieve of Sundaram program that was able to calculate a million prime numbers in a couple seconds. I believe that's pretty fast, but I wanted better. I went on to try to make a Sieve of Atkin, I slapped together working C++ code in 20 minutes after copying the pseudocode from Wikipedia.
I knew that it wouldn't be perfect because after all, its pseudocode. I was expecting at least better times than my Sundaram Sieve though, but I was so wrong. It's very very slow. I have looked it over many times but I cannot find any significant changes that could be made. When looking at my code remember, I know it's inefficient, I know I used system commands, I know it's all over the place, but this isn't a project or anything important, it's for me.
#include <iostream>
#include <fstream>
#include <time.h>
#include <Windows.h>
#include <vector>
using namespace std;
int main(){
float limit;
float slimit;
long int n;
int counter = 0;
int squarenum;
int starttime;
int endtime;
vector <bool> primes;
ofstream save;
save.open("primes.txt");
save.clear();
cout << "Find all primes up to: " << endl;
cin >> limit;
slimit = sqrt(limit);
primes.resize(limit);
starttime = time(0);
// sets all values to false
for (int i = 0; i < limit; i++){
primes[i] = false;
}
//puts in possible primes
for (int x = 1; x <= slimit; x++){
for (int y = 1; y <= slimit; y++){
n = (4*x*x) + (y*y);
if (n <= limit && (n%12 == 1 || n%12 == 5)){
primes[n] = !primes[n];
}
n = (3*x*x) + (y*y);
if (n <= limit && n% 12 == 7){
primes[n] = !primes[n];
}
n = (3*x*x) - (y*y);
if ( x > y && n <= limit && n%12 == 11){
primes[n] = !primes[n];
}
}
}
//square number mark all multiples not prime
for (float i = 5; i < slimit; i++){
if (primes[i] == true){
for (long int k = i*i; k < limit; k = k + (i*i)){
primes[k] = false;
}
}
}
endtime = time(0);
cout << endl << "Calculations complete, saving in text document" << endl;
// loads to document
for (int i = 0 ; i < limit ; i++){
if (primes[i] == true){
save << counter << ") " << i << endl;
counter++;
}
}
save << "Found in " << endtime - starttime << " seconds" << endl;
save.close();
system("primes.txt");
system ("Pause");
return 0;
}
This isn't exactly an answer (IMO, you've already gotten an answer in the comments), but a quick standard for comparison. A sieve of Eratosthenes should find a million primes in well under a second on a reasonably modern machine.
#include <vector>
#include <iostream>
#include <time.h>
unsigned long primes = 0;
int main() {
// empirically derived limit to get 1,000,000 primes
int number = 15485865;
clock_t start = clock();
std::vector<bool> sieve(number,false);
sieve[0] = sieve[1] = true;
for(int i = 2; i<number; i++) {
if(!sieve[i]) {
++primes;
for (int temp = 2*i; temp<number; temp += i)
sieve[temp] = true;
}
}
clock_t stop = clock();
std::cout.imbue(std::locale(""));
std::cout << "Total primes: " << primes << "\n";
std::cout << "Time: " << double(stop - start) / CLOCKS_PER_SEC << " seconds\n";
return 0;
}
Running this on my laptop, I get a result of:
Total primes: 1000000
Time: 0.106 seconds
Obviously, speed will vary somewhat with processor, clock speed, etc., but with anything reasonably modern, I'd still expect a time of less than a second. Of course, if you decide to write the primes out to a file, you can expect that to add some time, but even with that I'd expect a total time under a second--with my laptop's relatively slow hard drive, writing out the numbers only gets the total up to about 0.6 seconds.
vector is a bitset. It is expensive to update bitset values that are not in cache. Try vector, it is much cheaper to write to.