Least run time with threads - c++

I have some functions that check for prime numbers in a given range and output the results to a text file. The function also requires the user to input how many threads the function will use.
#include "threads.h"
#include <thread>
#include <ctime>
bool isPrime(int n)
if (n == 2)
return true;
int i = 2;
while (i < n)
if (n % i++ == 0)
return false;
return true;
void writePrimesToFile(int begin, int end, ofstream& file)
for (int i = (begin % 2 == 0 ? begin + 1 : begin); i <= end; i += 2)
if (isPrime(i))
file << i << endl;
void callWritePrimesMultipleThreads(int begin, int end, string filePath, int N)
ofstream file(filePath);
if (file.is_open())
clock_t time = clock();
for (int i = 0; i < N; i++)
int _begin = ((end - begin) / N) * i,
_end = ((end - begin) / N) * (i + 1);
thread t(writePrimesToFile, ref(_begin), ref(_end), ref(file));
cout << "Time elpased: " << (double(clock() - time) / CLOCKS_PER_SEC) << endl;
cout << "Can't open file!" << endl;
Here's the main:
#include "threads.h"
int main()
callWritePrimesMultipleThreads(0, 1000, "primes.txt", 10);
return 0;
I want to know, for which value of N, the amount of threads, the program will run the best.

An optimal solution to this problem would be to start N threads, each running the same function, that increments a shared counter (using InterlockedIncrement, for thread-safety), and calculates whether or not it is prime.
If you do this, and set N to the number of cores in your machine, you will get the optimal 100% CPU consumption, with zero context switches.
Regarding the output, notice that you can't safe write to one output file without locking (which you want to avoid). Instead, I would collect the prime numbers in an array or vectors (one vector for each thread), and eventually merge them into one vector, sort them and then write to file.


Trying to create a multithreaded program to find the total primes from 0-100000000

Hello I am trying to write a C++ multithreaded program using POSIX thread library to find the number of prime numbers between 1 and 10,000,000 (10 million) and find out how many microseconds it takes...
Creating my threads and running them works completely fine, however I feel as if there is an error found in my Prime function when determining if a number is prime or not...
I keep receiving 78496 as my output, however I desire 664579. Below is my code. Any hints or pointers would be greatly appreciated.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <unistd.h>
#include <iostream>
#include <sys/time.h> //measure the execution time of the computations
using namespace std;
//The number of thread to be generated
void * Prime(void* index);
long numbers[4] = {250000, 500000, 750000, 1000000};
long start_numbers[4] = {1, 250001, 500001, 750001};
int thread_numbers[4] = {0, 1, 2, 3};
int main(){
pthread_t tid[NUMBER_OF_THREADS];
int tn;
long sum = 0;
timeval start_time, end_time;
double start_time_microseconds, end_time_microseconds;
gettimeofday(&start_time, NULL);
start_time_microseconds = start_time.tv_sec * 1000000 + start_time.tv_usec;
for(tn = 0; tn < NUMBER_OF_THREADS; tn++){
if (pthread_create(&tid[tn], NULL, Prime, (void *) &thread_numbers[tn]) == -1 ) {
perror("thread fail");
long value[4];
for(int i = 0; i < NUMBER_OF_THREADS; i++){
if(pthread_join(tid[i],(void **) &value[i]) == 0){
sum = sum + value[i]; //add four sums together
perror("Thread join failed");
//get the end time in microseconds
gettimeofday(&end_time, NULL);
end_time_microseconds = end_time.tv_sec * 1000000 + end_time.tv_usec;
//calculate the time passed
double time_passed = end_time_microseconds - start_time_microseconds;
cout << "Sum is: " << sum << endl;
cout << "Running time is: " << time_passed << " microseconds" << endl;
//Prime function
void* Prime(void* index){
int temp_index;
temp_index = *((int*)index);
long sum_t = 0;
for(long i = start_numbers[temp_index]; i <= numbers[temp_index]; i++){
for (int j=2; j*j <= i; j++)
if (i % j == 0)
else if (j+1 > sqrt(i)) {
cout << "Thread " << temp_index << " terminates" << endl;
pthread_exit( (void*) sum_t);
This is because, you used 10^6 instead of 10^7.
Also, added some corner cases for numbers 1, 2 and 3:
//Prime function
void* Prime(void* index){
int temp_index;
temp_index = *((int*)index);
long sum_t = 0;
for(long i = start_numbers[temp_index]; i <= numbers[temp_index]; i++){
// Corner cases
if (i <= 3){
for (int j=2; j*j <= i; j++)
if ((i % j == 0) || (i %( j+2))==0 )
else if (j+1 > sqrt(i)) {
cout << "Thread " << temp_index << " terminates" << endl;
pthread_exit( (void*) sum_t);
I tested your code with correct number and got the correct number of primes as output:
Thread 0 terminates
Thread 1 terminates
Thread 2 terminates
Thread 3 terminates
Sum is: 664579
Running time is: 4.69242e+07 microseconds
Thanks to #chux - Reinstate Monica for pointing this out
Along with taking 10^7 as the numbers divided in thread instead of setting the limit as 10^6 ,a number of other small scale errors are there and a number of optimizations could be made -
First of all start numbers could be from 2 itself
long start_numbers[4] = {2, 2500001, 5000001, 7500001};
sum_t++ in your code may not work on edge cases. It is better to follow the following algorithm for calculating Prime function
bool flag = false;
for(long i = start_numbers[temp_index]; i <= numbers[temp_index]; i++){
flag = false;
for (long j=2; j*j <= i; j++){
if (i % j == 0 )
flag = true;
After these 2 operations i am getting the result as
Thread 0 terminates
Thread 1 terminates
Thread 2 terminates
Thread 3 terminates
Sum is: 664579
Running time is: 6.62618e+06 microseconds
( Note : in this case j is taken as long datatype but it could work as well with int in this 'example' since the tested compiler takes int as 32 bits long)

What are various methods to store very large integer value in a variable with less compilation time in C++ when doing operation on that variable

What should I do with this variable to make it store following large number without downloading any new libraries.I am talking about using some manipulation like hashing or arrays or something I don't know.
For fun I've written something that works only on strings. By the way, that number you gave is awfully large number, it's something like a quintillion times the mass of our solar system in kg.
There are two methods. The first one adds one to the number and checks if it's a palindrome. This is a slow version, but can still works for numbers up to like about 16 digits in a reasonable time.
The second method is the method better way, it basically copies the left side of the number to the right side, it's pretty much instant. As the code is now you can run it through both to cross-reference the results.
I can't say it's fool-proof and I'm sure there's errors in it, but it seems to work, and I did have fun writing it. Also, if you're not allowed to use ANY libraries whatsoever, it's rather easy to refactor, just use raw strings and pass the size in the function.
#include <iostream>
#include <string>
#include <chrono>
#include <stdexcept>
#include <cstring>
using namespace std::chrono;
using namespace std;
auto startT = high_resolution_clock::now();
auto endT = high_resolution_clock::now();
double timeTaken;
#define STARTCLOCK startT = high_resolution_clock::now();
#define STOPCLOCK endT = high_resolution_clock::now();
#define PRINT_ELAPSED_TIME timeTaken = duration_cast<milliseconds>(endT - startT).count() / 1000.0; \
cout << "Process took " << timeTaken << " seconds\n\n";
void addOneTo(std::string& value)
int64_t idx = value.size();
if (idx < 0) {
memset(&value[0], '0', value.size());
value.insert(value.begin(), '1');
value[idx] += char(1);
if (value[idx] > '9') { value[idx] = '0'; }
} while (value[idx] == '0');
bool isPalindrome(const std::string& number)
const char* start = &number[0];
const char* end = &number[number.size() - 1];
while (start <= end)
if (*start != *end) return false;
return true;
std::string getSmallestPalindromeByBruteForceBiggerThan(std::string num)
if (num.empty()) throw std::runtime_error("Empty string");
while (true)
if (isPalindrome(num)) return num;
std::string getSmallestPalindromeOptimisedWayBiggerThan(std::string num)
if (num.empty()) throw std::runtime_error("Empty string");
if (num.size() == 1) return num;
int64_t left;
int64_t right;
left = num.size() / 2 - 1;
if (num.size() % 2 == 0) right = num.size() / 2;
else right = num.size() / 2 + 1;
if (num[left] < num[right])
num[right] = num[left];
for (; left >= 0 && right < num.size(); --left, ++right)
num[right] = num[left];
return num;
int main()
string number = "60819750046451377";
string palindrome = getSmallestPalindromeByBruteForceBiggerThan(number);
cout << "____BRUTE FORCE____\n";
cout << "Smallest palindrome = \n" << palindrome << '\n';
palindrome = getSmallestPalindromeOptimisedWayBiggerThan(number);
cout << "____OPTIMISED____\n";
cout << "Smallest palindrome = \n" << palindrome << '\n';
If you don't need to perform any operations on that variable and can't use any libraries, including the C++ standard library, then use
const char* x = "1119191991900234245239919234772376189636415308431";
else the next best thing to use is a
std::string x = "1119191991900234245239919234772376189636415308431";
Even elementary arithmetic can be performed on such an encoding, the digit value at position n in the string is x[n] - '0'.
But all this is really rather silly. Suggest you look at the big number library that's part of the Boost distribution. See www.boost.org.

Why my Shell sorting is so slow

I am trying to implement shell sorting algorithm myself. I wrote my own code and didn't watch to any code samples only watch the video of algorithm description
My sort works but very slow (bubble sort 100 items - 0.007 s; shell sort 100 items - 4.83 s), how is it possible to improve it?
void print(vector<float>vec)
for (float i : vec)
cout << i << " ";
cout << "\n\n";
void Shell_sorting(vector<float>&values)
int swapping = 0;
int step = values.size();
clock_t start;
double duration;
start = clock();
while (step/2 >= 1)
step /= 2;
for (int i = 0; i < values.size()-step; i++)
if ((i + step < values.size()))
if ((values[i + step] < values[i]))
swap(values[i], values[i + step]);
int c = i;
while (c - step > 0)
if (values[c] < values[c - step])
swap(values[c], values[c - step]);
c -= step;
duration = (clock() - start) / (double)CLOCKS_PER_SEC;
cout << swapping << " " << duration;
A better implementation could be:
#include <iostream>
#include <vector>
int main()
std::vector<int> vec = {
std::vector<int> gaps = {5, 2, 1};
int j;
for (int gap : gaps) {
for (int i = gap; i < vec.size(); i++)
j = i-gap;
while (j >= 0) {
if (vec[j+gap] < vec[j])
int temp = vec[j+gap];
vec[j+gap] = vec[j];
vec[j] = temp;
j = j-gap;
else break;
for (int item : vec) std::cout << item << " " << std::endl;
return 0;
I prefer to use a vector to store gap data so that you do not need to compute the division (which is an expansive operation). Besides, this choice, gives your code more flexibility.
the extern loop cycles on gap values. Once choosen the gap, you iterate over your vector, starting from vec[gap] and explore if there are elements smaller then it according to the logic of the Shell Sort.
So, you start setting j=i-gap and test the if condition. If it is true, swap items and then repeat the while loop decrementing j. Note: vec[j+gap]is the element that in the last loop cycle was swapped. If the condition is true, there's no reason to continue in the loop, so you can exit from it with a break.
On my machine, it took 0.002s calculated using the time shell command (the time includes the process of printing numbers).
p.s. to generate all that numbers and write them in the array, since i'm too lazy to write a random function, i used this link and then i edited the output in the shell with:
sed -e 's/[[:space:]]/,/g' num | sed -e 's/$/,/'

Optimizing bubble sort - What am I missing?

I'm trying to understand possible optimization methods for the bubble sort algorithm. I know there are better sorting methods, but I'm just curious.
To test the efficiency I'm using std::chrono. The program sorts a 10000 number long int array 30 times and prints the average sorting time. The numbers are picked randomly(up to 10000) in every iteration. Here is the code, with no optimization:
#include <iostream>
#include <ctime>
#include <chrono>
using namespace std;
int main() {
//bubble sort
chrono::time_point<chrono::steady_clock> start, end;
const int n = 10000;
int i,j, last, tests = 30,arr[n];
long long total = 0;
bool out;
while (tests-->0) {
for (i = 0; i < n; i++) {
arr[i] = rand() % 1000;
j = n;
start = chrono::high_resolution_clock::now();
out = 0;
for (i = 0; i < j - 1; i++) {
if (arr[i + 1] < arr[i]) {
swap(arr[i + 1], arr[i]);
out = 1;
if (!out) {
end = chrono::high_resolution_clock::now();
total += chrono::duration_cast<chrono::nanoseconds>(end - start).count();
cout << "Remaining :"<<tests << endl;
cout << "Average :" << total / static_cast<double>(30)/1000000000<<" seconds"; // tests(30) + nanosec -> sec
return 0;
I get 0.17 seconds average sorting time.
If I uncomment line 47(j--;) to avoid comparing numbers already sorted I get 0.12 sorting time which is understandable.
If I remember the last position where a swap took place, I know that after that index, elements are sorted, and can thus sort up to that position in further iterations. It's better explained in the second part of this post: https://stackoverflow.com/a/16196115/1967496.
This is the code that implements the new possible optimization:
#include <iostream>
#include <ctime>
#include <chrono>
using namespace std;
int main() {
//bubble sort
chrono::time_point<chrono::steady_clock> start, end;
const int n = 10000;
int i,j, last, tests = 30,arr[n];
long long total = 0;
bool out;
while (tests-->0) {
for (i = 0; i < n; i++) {
arr[i] = rand() % 1000;
j = n;
start = chrono::high_resolution_clock::now();
out = 0;
for (i = 0; i < j - 1; i++) {
if (arr[i + 1] < arr[i]) {
swap(arr[i + 1], arr[i]);
out = 1;
last = i;
if (!out) {
j = last + 1;
end = chrono::high_resolution_clock::now();
total += chrono::duration_cast<chrono::nanoseconds>(end - start).count();
cout << "Remaining :"<<tests << endl;
cout << "Average :" << total / static_cast<double>(30)/1000000000<<" seconds"; // tests(30) + nanosec -> sec
return 0;
Note lines 40 and 48. And here comes the problem: The average time is now again around 0.17 seconds.
Is there a problem in my code, or am I missing something ?
I did sorting with 10 times more numbers and get now following results:
No optimization: 19.3 seconds
First optimization(j--): 14.5 seconds
Second (supposed) optimization(j=last+1): 17.4 seconds;
From my understanding, the second method should be in any case better than the first, but the numbers tell something else.
Well... The problem is that there might not be the right or wrong answer to this question.
First of all, when you're comparing only 10000 elements, you cannot really call it an effeciency test. Try comparing much higher number of elements - maybe 500000 (although you will probably need to alocate an array dynamicaly for that).
Second of all, it might be the compiler. Compilers often try to optimize things so that the program execution will run smoother and faster.

Sieve Of Atkin is surprisingly slow

I recently became very interested in prime numbers and tried making programs to calculate them. I was able to make a sieve of Sundaram program that was able to calculate a million prime numbers in a couple seconds. I believe that's pretty fast, but I wanted better. I went on to try to make a Sieve of Atkin, I slapped together working C++ code in 20 minutes after copying the pseudocode from Wikipedia.
I knew that it wouldn't be perfect because after all, its pseudocode. I was expecting at least better times than my Sundaram Sieve though, but I was so wrong. It's very very slow. I have looked it over many times but I cannot find any significant changes that could be made. When looking at my code remember, I know it's inefficient, I know I used system commands, I know it's all over the place, but this isn't a project or anything important, it's for me.
#include <iostream>
#include <fstream>
#include <time.h>
#include <Windows.h>
#include <vector>
using namespace std;
int main(){
float limit;
float slimit;
long int n;
int counter = 0;
int squarenum;
int starttime;
int endtime;
vector <bool> primes;
ofstream save;
cout << "Find all primes up to: " << endl;
cin >> limit;
slimit = sqrt(limit);
starttime = time(0);
// sets all values to false
for (int i = 0; i < limit; i++){
primes[i] = false;
//puts in possible primes
for (int x = 1; x <= slimit; x++){
for (int y = 1; y <= slimit; y++){
n = (4*x*x) + (y*y);
if (n <= limit && (n%12 == 1 || n%12 == 5)){
primes[n] = !primes[n];
n = (3*x*x) + (y*y);
if (n <= limit && n% 12 == 7){
primes[n] = !primes[n];
n = (3*x*x) - (y*y);
if ( x > y && n <= limit && n%12 == 11){
primes[n] = !primes[n];
//square number mark all multiples not prime
for (float i = 5; i < slimit; i++){
if (primes[i] == true){
for (long int k = i*i; k < limit; k = k + (i*i)){
primes[k] = false;
endtime = time(0);
cout << endl << "Calculations complete, saving in text document" << endl;
// loads to document
for (int i = 0 ; i < limit ; i++){
if (primes[i] == true){
save << counter << ") " << i << endl;
save << "Found in " << endtime - starttime << " seconds" << endl;
system ("Pause");
return 0;
This isn't exactly an answer (IMO, you've already gotten an answer in the comments), but a quick standard for comparison. A sieve of Eratosthenes should find a million primes in well under a second on a reasonably modern machine.
#include <vector>
#include <iostream>
#include <time.h>
unsigned long primes = 0;
int main() {
// empirically derived limit to get 1,000,000 primes
int number = 15485865;
clock_t start = clock();
std::vector<bool> sieve(number,false);
sieve[0] = sieve[1] = true;
for(int i = 2; i<number; i++) {
if(!sieve[i]) {
for (int temp = 2*i; temp<number; temp += i)
sieve[temp] = true;
clock_t stop = clock();
std::cout << "Total primes: " << primes << "\n";
std::cout << "Time: " << double(stop - start) / CLOCKS_PER_SEC << " seconds\n";
return 0;
Running this on my laptop, I get a result of:
Total primes: 1000000
Time: 0.106 seconds
Obviously, speed will vary somewhat with processor, clock speed, etc., but with anything reasonably modern, I'd still expect a time of less than a second. Of course, if you decide to write the primes out to a file, you can expect that to add some time, but even with that I'd expect a total time under a second--with my laptop's relatively slow hard drive, writing out the numbers only gets the total up to about 0.6 seconds.
vector is a bitset. It is expensive to update bitset values that are not in cache. Try vector, it is much cheaper to write to.