Cilk Plus code result depends on number of workers

Cilk Plus code result depends on number of workers - c++

I have a small piece of code that I would like to parallelize as I upscale. I've been using cilk_for from Cilk Plus to run the multithreading. The trouble is that I get a different result depending on the number of workers.
I've read that this might be due to a race condition, but I'm not sure what specifically about the code causes that or how to ameliorate it. Also, I realize that long and __float128 are overkill for this problem, but might be necessary in the upscaling.
Code:
#include <assert.h>
#include "cilk/cilk.h"
#include <cstring>
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <string>
#include <vector>
using namespace std;
__float128 direct(const vector<double>& Rpct, const vector<unsigned>& values, double Rbase, double toWin) {
unsigned count = Rpct.size();
__float128 sumProb = 0.0;
__float128 rProb = 0.0;
long nCombo = static_cast<long>(pow(2, count));
// for (long j = 0; j < nCombo; ++j) { //over every combination
cilk_for (long j = 0; j < nCombo; ++j) { //over every combination
vector<unsigned> binary;
__float128 prob = 1.0;
unsigned point = Rbase;
for (unsigned i = 0; i < count; ++i) { //over all the individual events
long exp = static_cast<long>(pow(2, count-i-1));
bool odd = (j/exp) % 2;
if (odd) {
binary.push_back(1);
point += values[i];
prob *= static_cast<__float128>(Rpct[i]);
} else {
binary.push_back(0);
prob *= static_cast<__float128>(1.0 - Rpct[i]);
}
}
sumProb += prob;
if (point >= toWin) rProb += prob;
assert(sumProb >= rProb);
}
//print sumProb
cout << " sumProb = " << (double)sumProb << endl;
assert( fabs(1.0 - sumProb) < 0.01);
return rProb;
}
int main(int argc, char *argv[]) {
vector<double> Rpct;
vector<unsigned> value;
value.assign(20,1);
Rpct.assign(20,0.25);
unsigned Rbase = 22;
unsigned win = 30;
__float128 rProb = direct(Rpct, value, Rbase, win);
cout << (double)rProb << endl;
return 0;
}
Sample output for export CILK_NWORKERS=1 && ./code.exe:
sumProb = 1
0.101812
Sample output for export CILK_NWORKERS=4 && ./code.exe:
sumProb = 0.948159
Assertion failed: (fabs(1.0 - sumProb) < 0.01), function direct, file code.c, line 61.
Abort trap: 6

It is because of a race condition. cilk_for is implementation of parallel for algorithm. If you want to use parallel for you must use independent iteration (independent data). It`is very important. You have to use cilk reducers for your case: https://www.cilkplus.org/tutorial-cilk-plus-reducers

To clarify, there is at least one race on sumProb. Each of the parallel workers will do a read/modify/write on that location. As sribin mentioned above, solving problems like this is what reducers are for.
It's entirely possible that there's more than one race in your program. The only way to be sure is to run it under a race detector, since finding races is one of the things that computers are much better at than humans. A free possibility is the Cilkscreen race detector, available from the cilkplus.org website. Unfortunately it doesn't support gcc/g++.

Related

Random generation with TRNG

For the following code which generates random numbers for Monte Carlo simulation, I need to receive the exact sum for each run, but this will not happen, although I have fixed the seed. I would appreciate it if anyone could point out the problem with this code
#include <cmath>
#include <random>
#include <iostream>
#include <chrono>
#include <cfloat>
#include <iomanip>
#include <cstdlib>
#include <omp.h>
#include <trng/yarn2.hpp>
#include <trng/mt19937_64.hpp>
#include <trng/uniform01_dist.hpp>
using namespace std;
using namespace chrono;
const double landa = 1;
const double exact_solution = landa / (pow(landa, 2) + 1);
double function(double x) {
return cos(x) / landa;
}
int main() {
int rank;
const int N = 1000000;
double sum = 0.0;
trng::yarn2 r[6];
for (int i = 0; i <6; i++)
{
r[i].seed(0);
}
for (int i = 0; i < 6; i++)
{
r[i].split(6,i);
}
trng::uniform01_dist<double> u;
auto start = high_resolution_clock::now();
#pragma omp parallel num_threads(6)
{
rank=omp_get_thread_num();
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
}
}
double app = sum / static_cast<double> (N);
auto end = high_resolution_clock::now();
auto diff=duration_cast<milliseconds>(end-start);
cout << "Approximation is: " <<setprecision(17) << app << "\t"<<"Time: "<< setprecision(17) << diff.count()<<" Error: "<<(app-exact_solution)<< endl;
return 0;
}

TL;DR The problem is two-fold:
Floating point addition is not associative;
You are generating different random number for each thread.
I need to receive the exact sum for each run, but this will not
happen, although I have fixed the seed. I would appreciate it if
anyone could point out the problem with this code
First, you have a race-condition on rank=omp_get_thread_num();, the variable rank is shared among all threads, to fix that you can declared the variable rank inside the parallel region, hence, making it private to each thread.
#pragma omp parallel num_threads(6)
{
int rank=omp_get_thread_num();
...
}
In your code, you should not expect that the value of the sum will be the same for different number of threads. Why ?
because you are adding doubles in parallel
double sum = 0.0;
...
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
}
and from What Every Computer Scientist Should Know about Floating
Point Arithmetic one can read:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the
expression (x+y)+z has a totally different answer than x+(y+z) when
x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the
latter).
Hence, from that you conclude that floating point addition is not
associative, and the reason why for a different number of threads you might have different sum values.
You are generating different random values per thread:
for (int i = 0; i < 6; i++)
{
r[i].split(6,i);
}
Consequently, for different number of threads, the variable sum
gets different results as well.
As kindly point out by jérôme-richard in the comments:
Note that more precise algorithm like the Kahan summation can
significantly reduces the rounding issue while being still relatively
fast.

Performance of pow(x,3.0f) vs xxx?

The following program...
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <cmath>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += std::pow(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.
Any insights? How do you account for the 5x performance difference?
For reference I'm using gcc -O3 on Linux x86
Update: (C Version)
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += powf(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
Update 2
The following program:
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += __builtin_powif(x,3.0f);
}
return t;
}
runs in 900ms like the first program.
Why isn't pow being inlined to __builtin_powif ?
Update 3:
With -ffast-math the following program:
#include <math.h>
#include <iostream>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += powf(x, 3.0f);
}
std::cout << t;
}
runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?

The doc page about gcc builtins is explicit (emphasize mine):
Built-in Function: double __builtin_powi (double, int)
Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.
Built-in Function: float __builtin_powif (float, int)
Similar to __builtin_powi, except the argument and return types are float.
As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.

% Assuming your compiler chose to just call pow in the shared library like https://godbolt.org/z/re3baK (without -ffast-math)
I did not take a look at how pow(float, float) is implemented, but I see some points.
x*x*x is inlined while pow can't be as it is in a shared library - function call overhead difference
Whether the exponent 3.0 is constant? If compiler know something is constant, it is likely to generate more efficient code
x*x*x : Just generates assembly for float value multiplication twice
pow : This must have considered all the exponent values so probably it has general code(less efficient, may include loops)

Time measurement of single arithmetic operation with gettimeofday()

I am trying to measure the time of atomics operations like bitwise for example.
The problem I had is that I can't just compute 0&1, because the IDE doing optimisation and ignoring this command, so I had to use assignment
num = 0&1.
So to get the accurate time of the operation without the assignment I was checking the time it takes to do an only assignment, I did that with x=0;
and return at the end something like this
return assign_and_comp - assign_only;
The problem is that I'm getting negative results pretty frequently.
Is it possible that num=0&1 cost less then x=0 ?
I cant use any time measuring function except gettimeofday() unfortunately
I've saw This soution , first im forced to use gettimeofday() but the most importent thing is that im mesuaring in the same way, geting the time before and after the operationg, and returning the diff.
BUT, i'm trying to isolate the assigment from the operationg, and this is not what they are doing in the soultion.
This is my full code.
#include <iostream>
#include <sys/time.h>
#include "osm.h"
#define NUM_ITERATIONS 1000000
#define SECOND_TO_NANO 1000000000.0
#define MICRO_TO_NANO 1000.0
using namespace std;
//globals variabels
struct timeval tvalBefore, tvalAfter;
double assign_only = 0.0;
int main() {
osm_init();
cout << osm_operation_time(50000) << endl;
return 0;
}
int osm_init(){
int x=0;
gettimeofday(&tvalBefore,NULL);
for (int i=0; i<NUM_ITERATIONS; i++){
x = 0;
}
gettimeofday(&tvalAfter,NULL);
assign_only = ((tvalAfter.tv_sec-tvalBefore.tv_sec)*SECOND_TO_NANO+
(tvalAfter.tv_usec-tvalBefore.tv_usec)*MICRO_TO_NANO)/NUM_ITERATIONS;
return 0;
}
double osm_operation_time(unsigned int iterations){
volatile int num=0;
gettimeofday(&tvalBefore,NULL);
for (int i=0; i<iterations; i++){
num = 0&1;
}
gettimeofday(&tvalAfter,NULL);
double assign_and_comp = ((tvalAfter.tv_sec-tvalBefore.tv_sec)*SECOND_TO_NANO+
(tvalAfter.tv_usec-tvalBefore.tv_usec)*MICRO_TO_NANO)/iterations;
return assign_and_comp-assign_only;
}

Rounding errors giving incorrect tesults in DFT?

I have been beating my head against the wall on this DFT. It should print out: 8,0,0,0,0,0,0,0 but instead I get 8 and then very very tiny numbers. Are these rounding errors? Is there anything I can do? My Radix2 FFT gives correct results, it seems silly a DFT could not also work.
I started with complex numbers so I know there is a good bit missing, I tried to strip it down to illustrate the problem.
#include <cstdlib>
#include <math.h>
#include <iostream>
#include <complex>
#include <cassert>
#define SIZE 8
#define M_PI 3.14159265358979323846
void fft(const double src[], double dst[], const unsigned int n)
{
for(int i=0; i < SIZE; i++)
{
const double ph = -(2*M_PI) / n;
const int gid = i;
double res = 0.0f;
for (int k = 0; k < n; k++) {
double t = src[k];
const double val = ph * k * gid;
double cs = cos(val);
double sn = sin(val);
res += ((t * cs) - (t * sn));
int a = 1;
}
dst[i] = res;
std::cout << dst[i] << std::endl;
}
}
int main(void)
{
double array1[SIZE];
double array2[SIZE];
for(int i=0; i < SIZE; i++){
array1[i] = 1;
array2[i] = 0;
}
fft(array1, array2, SIZE);
return 666;
}

An FFT can actually produce more accurate results than a straight DFT calculation, as the fewer arithmetic ops usually allow fewer opportunities for arithmetic quantization errors to accumulate. There's a paper by one of the FFTW authors on this topic.
Since the DFT/FFT deal with a transcendental basis function, the results will never (except perhaps in a few special cases, or by lucky accident) be exactly correct using any non-symbolic and finite computer number format. So values very close (within a few LSB) to zero should simply be ignored as noise, or considered to be the same as zero.

Is using a vector of boolean values slower than a dynamic bitset?

Is using a vector of boolean values slower than a dynamic bitset?
I just heard about boost's dynamic bitset, and I was wondering is it worth
the trouble. Can I just use vector of boolean values instead?

A great deal here depends on how many Boolean values you're working with.
Both bitset and vector<bool> normally use a packed representation where a Boolean is stored as only a single bit.
On one hand, that imposes some overhead in the form of bit manipulation to access a single value.
On the other hand, that also means many more of your Booleans will fit in your cache.
If you're using a lot of Booleans (e.g., implementing a sieve of Eratosthenes) fitting more of them in the cache will almost always end up a net gain. The reduction in memory use will gain you a lot more than the bit manipulation loses.
Most of the arguments against std::vector<bool> come back to the fact that it is not a standard container (i.e., it does not meet the requirements for a container). IMO, this is mostly a question of expectations -- since it says vector, many people expect it to be a container (other types of vectors are), and they often react negatively to the fact that vector<bool> isn't a container.
If you're using the vector in a way that really requires it to be a container, then you probably want to use some other combination -- either deque<bool> or vector<char> can work fine. Think before you do that though -- there's a lot of (lousy, IMO) advice that vector<bool> should be avoided in general, with little or no explanation of why it should be avoided at all, or under what circumstances it makes a real difference to you.
Yes, there are situations where something else will work better. If you're in one of those situations, using something else is clearly a good idea. But, be sure you're really in one of those situations first. Anybody who tells you (for example) that "Herb says you should use vector<char>" without a lot of explanation about the tradeoffs involved should not be trusted.
Let's give a real example. Since it was mentioned in the comments, let's consider the Sieve of Eratosthenes:
#include <vector>
#include <iostream>
#include <iterator>
#include <chrono>
unsigned long primes = 0;
template <class bool_t>
unsigned long sieve(unsigned max) {
std::vector<bool_t> sieve(max, false);
sieve[0] = sieve[1] = true;
for (int i = 2; i < max; i++) {
if (!sieve[i]) {
++primes;
for (int temp = 2 * i; temp < max; temp += i)
sieve[temp] = true;
}
}
return primes;
}
// Warning: auto return type will fail with older compilers
// Fine with g++ 5.1 and VC++ 2015 though.
//
template <class F>
auto timer(F f, int max) {
auto start = std::chrono::high_resolution_clock::now();
primes += f(max);
auto stop = std::chrono::high_resolution_clock::now();
return stop - start;
}
int main() {
using namespace std::chrono;
unsigned number = 100000000;
auto using_bool = timer(sieve<bool>, number);
auto using_char = timer(sieve<char>, number);
std::cout << "ignore: " << primes << "\n";
std::cout << "Time using bool: " << duration_cast<milliseconds>(using_bool).count() << "\n";
std::cout << "Time using char: " << duration_cast<milliseconds>(using_char).count() << "\n";
}
We've used a large enough array that we can expect a large portion of it to occupy main memory. I've also gone to a little pain to ensure that the only thing that changes between one invocation and the other is the use of a vector<char> vs. vector<bool>. Here are some results. First with VC++ 2015:
ignore: 34568730
Time using bool: 2623
Time using char: 3108
...then the time using g++ 5.1:
ignore: 34568730
Time using bool: 2359
Time using char: 3116
Obviously, the vector<bool> wins in both cases--by around 15% with VC++, and over 30% with gcc. Also note that in this case, I've chosen the size to show vector<char> in quite favorable light. If, for example, I reduce number from 100000000 to 10000000, the time differential becomes much larger:
ignore: 3987474
Time using bool: 72
Time using char: 249
Although I haven't done a lot of work to confirm, I'd guess that in this case, the version using vector<bool> is saving enough space that the array fits entirely in the cache, while the vector<char> is large enough to overflow the cache, and involve a great deal of main memory access.

You should usually avoid std::vector<bool> because it is not a standard container. It's a packed version, so it breaks some valuable guarantees usually given by a vector. A valid alternative would be to use std::vector<char> which is what Herb Sutter recommends.
You can read more about it in his GotW on the subject.
Update:
As has been pointed out, vector<bool> can be used to good effect, as a packed representation improves locality on large data sets. It may very well be the fastest alternative depending on circumstances. However, I would still not recommend it by default since it breaks many of the promises established by std::vector and the packing is a speed/memory tradeoff which may be beneficial in both speed and memory.
If you choose to use it, I would do so after measuring it against vector<char> for your application. Even then, I'd recommend using a typedef to refer to it via a name which does not seem to make the guarantees which it does not hold.

#include "boost/dynamic_bitset.hpp"
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
int main(int, char*[])
{
auto gen = std::bind(std::uniform_int_distribution<>(0, 1), std::default_random_engine());
std::vector<char> randomValues(1000000);
for (char & randomValue : randomValues)
{
randomValue = static_cast<char>(gen());
}
// many accesses, few initializations
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
std::vector<bool> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Time taken1: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< " milliseconds" << std::endl;
auto start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
boost::dynamic_bitset<> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end2 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken2: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count()
<< " milliseconds" << std::endl;
auto start3 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 500; ++i)
{
std::vector<char> test(1000000, false);
for (int j = 0; j < test.size(); ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end3 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken3: " << std::chrono::duration_cast<std::chrono::milliseconds>(end3 - start3).count()
<< " milliseconds" << std::endl;
// few accesses, many initializations
auto start4 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
std::vector<bool> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end4 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken4: " << std::chrono::duration_cast<std::chrono::milliseconds>(end4 - start4).count()
<< " milliseconds" << std::endl;
auto start5 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
boost::dynamic_bitset<> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end5 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken5: " << std::chrono::duration_cast<std::chrono::milliseconds>(end5 - start5).count()
<< " milliseconds" << std::endl;
auto start6 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; ++i)
{
std::vector<char> test(1000000, false);
for (int j = 0; j < 500; ++j)
{
test[j] = static_cast<bool>(randomValues[j]);
}
}
auto end6 = std::chrono::high_resolution_clock::now();
std::cout << "Time taken6: " << std::chrono::duration_cast<std::chrono::milliseconds>(end6 - start6).count()
<< " milliseconds" << std::endl;
return EXIT_SUCCESS;
}
Time taken1: 1821 milliseconds
Time taken2: 1722 milliseconds
Time taken3: 25 milliseconds
Time taken4: 1987 milliseconds
Time taken5: 1993 milliseconds
Time taken6: 10970 milliseconds
dynamic_bitset = std::vector<bool>
if you allocate many times but you only access the array that you created few times, go for std::vector<bool> because it has lower allocation/initialization time.
if you allocate once and access many times, go for std::vector<char>, because of faster access
Also keep in mind that std::vector<bool> is NOT safe to be used is in multithreading because you might write to different bits but it might be the same byte.

It appears that the size of a dynamic bitset cannot be changed:
"The dynamic_bitset class is nearly identical to the std::bitset class. The difference is that the size of the dynamic_bitset (the number of bits) is specified at run-time during the construction of a dynamic_bitset object, whereas the size of a std::bitset is specified at compile-time through an integer template parameter." (from http://www.boost.org/doc/libs/1_36_0/libs/dynamic_bitset/dynamic_bitset.html)
As such, it should be slightly faster since it will have slightly less overhead than a vector, but you lose the ability to insert elements.

UPDATE: I just realize that OP was asking about vector<bool> vs bitset, and my answer does not answer the question, but I think I should leave it, if you search for c++ vector bool slow, you end up here.
vector<bool> is terribly slow. At least on my Arch Linux system (you can probably get a better implementation or something... but I was really surprised). If anybody has any suggestions why this is so slow, I'm all ears! (Sorry for the blunt beginning, here's the more professional part.)
I've written two implementations of the SOE, and the 'close to metal' C implementation is 10 times faster. sievec.c is the C implementation, and sievestl.cpp is the C++ implementation. I just compiled with make (implicit rules only, no makefile): and the results were 1.4 sec for the C version, and 12 sec for the C++/STL version:
sievecmp % make -B sievec && time ./sievec 27
cc sievec.c -o sievec
aa 1056282
./sievec 27 1.44s user 0.01s system 100% cpu 1.455 total
and
sievecmp % make -B sievestl && time ./sievestl 27
g++ sievestl.cpp -o sievestl
1056282./sievestl 27 12.12s user 0.01s system 100% cpu 12.114 total
sievec.c is as follows:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long prime_t;
typedef unsigned long word_t;
#define LOG_WORD_SIZE 6
#define INDEX(i) ((i)>>(LOG_WORD_SIZE))
#define MASK(i) ((word_t)(1) << ((i)&(((word_t)(1)<<LOG_WORD_SIZE)-1)))
#define GET(p,i) (p[INDEX(i)]&MASK(i))
#define SET(p,i) (p[INDEX(i)]|=MASK(i))
#define RESET(p,i) (p[INDEX(i)]&=~MASK(i))
#define p2i(p) ((p)>>1) // (((p-2)>>1))
#define i2p(i) (((i)<<1)+1) // ((i)*2+3)
unsigned long find_next_zero(unsigned long from,
unsigned long *v,
size_t N){
size_t i;
for (i = from+1; i < N; i++) {
if(GET(v,i)==0) return i;
}
return -1;
}
int main(int argc, char *argv[])
{
size_t N = atoi(argv[1]);
N = 1lu<<N;
// printf("%u\n",N);
unsigned long *v = malloc(N/8);
for(size_t i = 0; i < N/64; i++) v[i]=0;
unsigned long p = 3;
unsigned long pp = p2i(p * p);
while( pp <= N){
for(unsigned long q = pp; q < N; q += p ){
SET(v,q);
}
p = p2i(p);
p = find_next_zero(p,v,N);
p = i2p(p);
pp = p2i(p * p);
}
unsigned long sum = 0;
for(unsigned long i = 0; i+2 < N; i++)
if(GET(v,i)==0 && GET(v,i+1)==0) {
unsigned long p = i2p(i);
// cout << p << ", " << p+2 << endl;
sum++;
}
printf("aa %lu\n",sum);
// free(v);
return 0;
}
sievestl.cpp is as follows:
#include <iostream>
#include <vector>
#include <sstream>
using namespace std;
inline unsigned long i2p(unsigned long i){return (i<<1)+1; }
inline unsigned long p2i(unsigned long p){return (p>>1); }
inline unsigned long find_next_zero(unsigned long from, vector<bool> v){
size_t N = v.size();
for (size_t i = from+1; i < N; i++) {
if(v[i]==0) return i;
}
return -1;
}
int main(int argc, char *argv[])
{
stringstream ss;
ss << argv[1];
size_t N;
ss >> N;
N = 1lu<<N;
// cout << N << endl;
vector<bool> v(N);
unsigned long p = 3;
unsigned long pp = p2i(p * p);
while( pp <= N){
for(unsigned long q = pp; q < N; q += p ){
v[q] = 1;
}
p = p2i(p);
p = find_next_zero(p,v);
p = i2p(p);
pp = p2i(p * p);
}
unsigned sum = 0;
for(unsigned long i = 0; i+2 < N; i++)
if(v[i]==0 and v[i+1]==0) {
unsigned long p = i2p(i);
// cout << p << ", " << p+2 << endl;
sum++;
}
cout << sum;
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cilk Plus code result depends on number of workers - c++

It is because of a race condition. cilk_for is implementation of parallel for algorithm. If you want to use parallel for you must use independent iteration (independent data). It`is very important. You have to use cilk reducers for your case: https://www.cilkplus.org/tutorial-cilk-plus-reducers

Related

Random generation with TRNG

Performance of pow(x,3.0f) vs xxx?

Time measurement of single arithmetic operation with gettimeofday()

Rounding errors giving incorrect tesults in DFT?

Is using a vector of boolean values slower than a dynamic bitset?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cilk Plus code result depends on number of workers - c++

It is because of a race condition. cilk_for is implementation of parallel for algorithm. If you want to use parallel for you must use independent iteration (independent data). It`is very important. You have to use cilk reducers for your case: https://www.cilkplus.org/tutorial-cilk-plus-reducers

Related

Random generation with TRNG

Performance of pow(x,3.0f) vs x*x*x?

Time measurement of single arithmetic operation with gettimeofday()

Rounding errors giving incorrect tesults in DFT?

Is using a vector of boolean values slower than a dynamic bitset?

Categories

Resources

Performance of pow(x,3.0f) vs xxx?