Why does rand from stdlib not follow law of large numbers? - c++

In the following code, in which I expect a die that roles bilions of times that the average outcome to be exactly 3.5, the percentage that lies above 3.5 sometimes is like 5 percent and other times (with different seed of course) is like 95. But even when you go as high as 6040M thows, you never end up near 50% above, 50% under 3.5? Obviously there's a little bias in rand()...
I know about the fact that 'real random' doesn't exist but is it really this obvious?
Typical outputs are:
Average: 3.50003 counter: 3427000000 Percentage above: 83.2554 Perc abs above counter: 50.0011
Average: 3.49999 counter: 1093000000 Percentage above: 92.6983 Perc abs above counter: 50.0003
#include <stdio.h> /* printf, scanf, puts, NULL */
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
#include <unistd.h>
#include <iostream>
using namespace std;
int main ()
{
long long int this_nr;
long long int counter = 0;
long long int above_counter = 0;
long long int below_counter = 0;
long long int above_counter_this = 0;
long long int below_counter_this = 0;
long long int interval_counter = 0;
double avg = 0.0;
srand (time(NULL));
srand (time(NULL));
srand (time(NULL));
cout.precision(6);
while(1) {
this_nr = rand() % 6 + 1; // 0,1,2,3,4,5 or 6
avg = ((double) this_nr + ((double)counter * (double) avg))
/ ((double) counter+1.0);
if (this_nr <= 3) below_counter_this++;
if (this_nr >= 4) above_counter_this++;
if (avg < 3.5) below_counter++;
if (avg > 3.5) above_counter++;
if (interval_counter >= 1000000) {
cout << "Average: " << avg << " counter: " << counter << " Percentage above: "
<< (double) above_counter / (double) counter * 100.0
<< " Perc abs above counter: " << 100.0 * above_counter_this / counter
<< " \r";
interval_counter = 0;
}
//usleep(1);
counter++;
interval_counter++;
}
}

rand() is well known to be a terrible generator, and it's particularly bad in the low bits. Performing % 6 is picking off only the low bits. There's also a chance that you're experiencing some modulo bias, but I'd expect that effect to be relatively minor.

Related

My program for calculating pi using Chudnovsky in C++ precision problem

My code:
#include <iostream>
#include <iomanip>
#include <cmath>
long double fac(long double num) {
long double result = 1.0;
for (long double i=2.0; i<num; i++)
result *= i;
return result;
}
int main() {
using namespace std;
long double pi=0.0;
for (long double k = 0.0; k < 10.0; k++) {
pi += (pow(-1.0,k) * fac(6.0 * k) * (13591409.0 + (545140134.0 * k)))
/ (fac(3.0 * k) * pow(fac(k), 3.0) * pow(640320.0, 3.0 * k + 3.0/2.0));
}
pi *= 12.0;
cout << setprecision(100) << 1.0 / pi << endl;
return 0;
}
My output:
3.1415926535897637228433865175247774459421634674072265625
The problem with this output is that it outputed 56 digits instead of 100; How do I fix that?
First of all your factorial is wrong the loop should be for (long double i=2.0; i<=num; i++) instead of i<num !!!
As mentioned in the comments double can hold only up to ~16 digits so your 100 digits is not doable by this method. To remedy this there are 2 ways:
use high precision datatype
there are libs for this, or you can implement it on your own you need just few basic operations. Note that to represent 100 digits you need at least
ceil(100 digits/log10(2)) = 333 bits
of mantisa or fixed point integer while double has only 53
53*log10(2) = 15.954589770191003346328161420398 digits
use different method of computation of PI
For arbitrary precision I recommend to use BPP However if you want just 100 digits you can use simple taylor seriesbased like this on strings (no need for any high precision datatype nor FPU):
//The following 160 character C program, written by Dik T. Winter at CWI, computes pi to 800 decimal digits.
int a=10000,b=0,c=2800,d=0,e=0,f[2801],g=0;main(){for(;b-c;)f[b++]=a/5;
for(;d=0,g=c*2;c-=14,printf("%.4d",e+d/a),e=d%a)for(b=c;d+=f[b]*a,f[b]=d%--g,d/=g--,--b;d*=b);}
Aside the obvious precision limits Your implementation is really bad from both performance and precision aspects that is why you lost precision way sooner as you hitting double precision limits in very low iterations of k. If you rewrite the iterations so the subresults are as small as can be (in terms of bits of mantisa) and not use too much unnecessary computations here few hints:
why are you computing the same factorials again and again
You have k! in loop where k is incrementing why not just multiply the k to some variable holding actual factorial instead? for example:
//for ( k=0;k<10;k++){ ... fac(k) ... }
for (f=1,k=0;k<10;k++){ if (k) f*=k; ... f ... }
why are you divide by factorials again and again
if you think a bit about it then if (a>b) you can compute this instead:
a! / b! = (1*2*3*4*...*b*...*a) / (1*2*3*4*...*b)
a! / b! = (b+1)*(b+2)*...*(a)
I would not use pow at all for this
pow is "very complex" function causing further precision and performance losses for example pow(-1.0,k) can be done like this:
//for ( k=0;k<10;k++){ ... pow(-1.0,k) ... }
for (s=+1,k=0;k<10;k++){ s=-s; ... s ... }
Also pow(640320.0, 3.0 * k + 3.0/2.0)) can be computed in the same way as factorial, pow(fac(k), 3.0) you can 3 times multipply the variable holding fac(k) instead ...
the therm pow(640320.0, 3.0 * k + 3.0/2.0) outgrows even (6k)!
so you can divide it by it to keep subresults smaller...
These few simple tweaks will enhance the precision a lot as you will overflow the double precision much much latter as the subresults will be much smaller then the naive ones as factorials tend to grow really fast
Putting all together leads to this:
double pi_Chudnovsky() // no pow,fac lower subresult
{ // https://en.wikipedia.org/wiki/Chudnovsky_algorithm
double pi,s,f,f3,k,k3,k6,p,dp,q,r;
for (pi=0.0,s=1.0,f=f3=1,k=k3=k6=0.0,p=640320.0,dp=p*p*p,p*=sqrt(p),r=13591409.0;k<27.0;k++,s=-s)
{
if (k) // f=k!, f3=(3k)!, p=pow(640320.0,3k+1.5)*(3k)!/(6k)!, r=13591409.0+(545140134.0*k)
{
p*=dp; r+=545140134.0;
f*=k; k3++; f3*=k3; k6++; p/=k6; p*=k3;
k3++; f3*=k3; k6++; p/=k6; p*=k3;
k3++; f3*=k3; k6++; p/=k6; p*=k3;
k6++; p/=k6;
k6++; p/=k6;
k6++; p/=k6;
}
q=s*r; q/=f; q/=f; q/=f; q/=p; pi+=q;
}
return 1.0/(pi*12.0);
}
as you can see k goes up to 27, while your naive method can go only up to 18 on 64 bit doubles before overflow. However the result is the same as the double mantissa is saturated after 2 iterations ...
I am feeling happy due to following code :)
/*
I have compiled using cygwin
change "iostream...using namespace std" OR iostream.h based on your compiler at related OS.
*/
#include <iostream>
#include <iomanip>
#include <cmath>
using namespace std;
long double fac(long double num)
{
long double result = 1.0;
for (long double i=2.0; num > i; ++i)
{
result *= i;
}
return result;
}
int main()
{
long double pi=0.0;
for (long double k = 0.0; 10.0 > k; ++k)
{
pi += (pow(-1.0,k) * fac(6.0 * k) * (13591409.0 + (545140134.0 * k)))
/ (fac(3.0 * k) * pow(fac(k), 3.0) * pow(640320.0, 3.0 * k + 3.0/2.0));
}
pi *= 12.0;
cout << "BEFORE USING setprecision VALUE OF DEFAULT PRECISION " << cout.precision() << "\n";
cout << setprecision(100) << 1.0 / pi << endl;
cout << "AFTER USING setprecision VALUE OF CURRENT PRECISION WITHOUT USING fixed " << cout.precision() << "\n";
cout << fixed;
cout << "AFTER USING setprecision VALUE OF CURRENT PRECISION USING fixed " << cout.precision() << "\n";
cout << "USING fixed PREVENT THE EARTH'S ROUNDING OFF INSIDE OUR UNIVERSE :)\n";
cout << setprecision(100) << 1.0 / pi << endl;
return 0;
}
/*
$ # Sample output:
$ g++ 73256565.cpp -o ./a.out;./a.out
$ ./a.out
BEFORE USING setprecision VALUE OF DEFAULT PRECISION 6
3.14159265358976372457810999350158454035408794879913330078125
AFTER USING setprecision VALUE OF CURRENT PRECISION WITHOUT USING fixed 100
AFTER USING setprecision VALUE OF CURRENT PRECISION USING fixed 100
USING fixed PREVENT THE EARTH'S ROUNDING OFF INSIDE OUR UNIVERSE :)
3.1415926535897637245781099935015845403540879487991333007812500000000000000000000000000000000000000000
*/

My result is not truly random; How can I fix this?

float genData(int low, int high);
int main(){
srand(time(0));
float num = genData(40, 100);
cout << fixed << left << setprecision(2) << num << endl;
return 0;
}
float genData(int low, int high) {
low *= 100;
high *= 100 + 1;
int rnd = rand() % (high - low) + low;
float newRand;
newRand = (float) rnd / 100;
return newRand;
}
I'm expecting a random number between 40 and 100 inclusively with two decimal places.
eg: 69.69, 42.00
What I get is the same number with different decimal values, slowly increasing every time I run the program.
Use the <random> header for that:
#include <iostream>
#include <random>
float getData(int const low, int const high) {
thread_local auto engine = std::mt19937{ std::random_device{}() };
auto dist = std::uniform_int_distribution<int>{ low * 100, high * 100 };
return dist(engine) / 100.0f;
}
int main() {
for (auto i = 0; i != 5; ++i) {
std::cout << getData(40, 100) << '\n';
}
}
Wrong range
int rnd = rand() % (high - low) + low; does not generate the right range.
float genData(int low, int high) {
low *= 100;
// high *= 100 + 1;
high = high*100 + 1;
expecting a random number between 40 and 100 inclusively with two decimal places. eg: 69.69, 42.00
That is [40.00 ... 100.00] or 10000-4000+1 different values
int rnd = rand() % (100*(high - low) + 1) + low*100;
float frnd = rnd/100.0f;
rand() weak here when RAND_MAX is small
With RAND_MAX == 32767, int rnd = rand() % (high - low) + low; is like [0... 32767] % 6001 + 40000. [0... 32767] % 6001 does not provide a very uniform distribution. Some values coming up 6 times and others 7-ish.
If you are using C++ 11 you can use better random number generators that will give you diversity in the generated numbers in addition to being a lot faster.
Quick Example:
#include <random> // The header for the generators.
#include <ctime> // To seed the generator.
// Generating random numbers with C++11's random requires an engine and a distribution.
mt19937_64 rng(seed);
// For instance, we will create a uniform distribution for integers in the (low, high) range:
uniform_int_distribution<int> unii(low, high);
// Show the random number
cout << unii(rng) << ' ';
You can follow this article for more explanation from here.

How can I get a more accurate result when dividing numbers in C++

I am trying to estimate PI using C++ as a fun math project. I've run into an issues where I can only get it as precise as 6 decimal places.
I have tried using a float instead of a double but found the same result.
My code works by summing all the results of 1/n^2 where n=1 through to a defined limit. It then multiplies this result by 6 and takes the square root.
Here is a link to an image written out in mathematical notation
Here is my main function. PREC is the predefined limit. It will populate the array with the results of these fractions and get the sum. My guess is that the sqrt function is causing the issue where I cannot get more precise than 6 digits.
int main(int argc, char *argv[]) {
nthsums = new float[PREC];
for (int i = 1; i < PREC + 1; i += 1) {
nthsums[i] = nth_fraction(i);
}
float array_sum = sum_array(nthsums);
array_sum *= 6.000000D;
float result = sqrt(array_sum);
std::string resultString = std::to_string(result);
cout << resultString << "\n";
}
Just for the sake of it, I'll also include my sum function as I suspect that there could be something wrong with that, too.
float sum_array(float *array) {
float returnSum = 0;
for (int itter = 0; itter < PREC + 1; itter += 1) {
if (array[itter] >= 0) {
returnSum += array[itter];
}
}
return returnSum;
}
I would like to get at least as precise as 10 digits. Is there any way to do this in C++?
So even with long double as the floating point type used for this, there's some subtlety required because adding two long doubles of substantially different order of magnitudes can cause precision loss. See here for a discussion in Java but I believe it to be basically the same behavior in C++.
Code I used:
#include <iostream>
#include <cmath>
#include <numbers>
long double pSeriesApprox(unsigned long long t_terms)
{
long double pi_squared = 0.L;
for (unsigned long long i = t_terms; i >= 1; --i)
{
pi_squared += 6.L * (1.L / i) * (1.L / i);
}
return std::sqrtl(pi_squared);
}
int main(int, char[]) {
const long double pi = std::numbers::pi_v<long double>;
const unsigned long long num_terms = 10'000'000'000;
std::cout.precision(30);
std::cout << "Pi == " << pi << "\n\n";
std::cout << "Pi ~= " << pSeriesApprox(num_terms) << " after " << num_terms << " terms\n";
return 0;
}
Output:
Pi == 3.14159265358979311599796346854
Pi ~= 3.14159265349430016911469465413 after 10000000000 terms
9 decimal digits of accuracy, which is about what we'd expect from a series converging at this rate.
But if all I do is reverse the order the loop in pSeriesApprox goes, adding the exact same terms but from largest to smallest instead of smallest to largest:
long double pSeriesApprox(unsigned long long t_terms)
{
long double pi_squared = 0.L;
for (unsigned long long i = 1; i <= t_terms; ++i)
{
pi_squared += 6.L * (1.L / i) * (1.L / i);
}
return std::sqrtl(pi_squared);
}
Output:
Pi == 3.14159265358979311599796346854
Pi ~= 3.14159264365071688729358356795 after 10000000000 terms
Suddenly we're down to 7 digits of accuracy, even though we used 10 billion terms. In fact, after 100 million terms or so, the approximation to pi stabilizes at this specific value. So while using sufficiently large data types to store these computations is important, some additional care is still needed when trying to perform this kind of sum.

Counting iterations of the Leibniz summation for π in C++

My task is to ask the user to how many decimal places of accuracy they want the summation to iterate compared to the actual value of pi. So 2 decimal places would stop when the loop reaches 3.14. I have a complete program, but I am unsure if it actually works as intended. I have checked for 0 and 1 decimal places with a calculator and they seem to work, but I don't want to assume it works for all of them. Also my code may be a little clumsy since were are still learning the basics. We only just learned loops and nested loops. If there are any obvious mistakes or parts that could be cleaned up, I would appreciate any input.
Edit: I only needed to have this work for up to five decimal places. That is why my value of pi was not precise. Sorry for the misunderstanding.
#include <iostream>
#include <cmath>
using namespace std;
int main() {
const double PI = 3.141592;
int n, sign = 1;
double sum = 0,test,m;
cout << "This program determines how many iterations of the infinite series for\n"
"pi is needed to get with 'n' decimal places of the true value of pi.\n"
"How many decimal places of accuracy should there be?" << endl;
cin >> n;
double p = PI * pow(10.0, n);
p = static_cast<double>(static_cast<int>(p) / pow(10, n));
int counter = 0;
bool stop = false;
for (double i = 1;!stop;i = i+2) {
sum = sum + (1.0/ i) * sign;
sign = -sign;
counter++;
test = (4 * sum) * pow(10.0,n);
test = static_cast<double>(static_cast<int>(test) / pow(10, n));
if (test == p)
stop = true;
}
cout << "The series was iterated " << counter<< " times and reached the value of pi\nwithin "<< n << " decimal places." << endl;
return 0;
}
One of the problems of the Leibniz summation is that it has an extremely low convergence rate, as it exhibits sublinear convergence. In your program you also compare a calculated extimation of π with a given value (a 6 digits approximation), while the point of the summation should be to find out the right figures.
You can slightly modify your code to make it terminate the calculation if the wanted digit doesn't change between iterations (I also added a max number of iterations check). Remember that you are using doubles not unlimited precision numbers and sooner or later rounding errors will affect the calculation. As a matter of fact, the real limitation of this code is the number of iterations it takes (2,428,700,925 to obtain 3.141592653).
#include <iostream>
#include <cmath>
#include <iomanip>
using std::cout;
// this will take a long long time...
const unsigned long long int MAX_ITER = 100000000000;
int main() {
int n;
cout << "This program determines how many iterations of the infinite series for\n"
"pi is needed to get with 'n' decimal places of the true value of pi.\n"
"How many decimal places of accuracy should there be?\n";
std::cin >> n;
// precalculate some values
double factor = pow(10.0,n);
double inv_factor = 1.0 / factor;
double quad_factor = 4.0 * factor;
long long int test = 0, old_test = 0, sign = 1;
unsigned long long int count = 0;
double sum = 0;
for ( long long int i = 1; count < MAX_ITER; i += 2 ) {
sum += 1.0 / (i * sign);
sign = -sign;
old_test = test;
test = static_cast<long long int>(sum * quad_factor);
++count;
// perform the test on integer values
if ( test == old_test ) {
cout << "Reached the value of Pi within "<< n << " decimal places.\n";
break;
}
}
double pi_leibniz = static_cast<double>(inv_factor * test);
cout << "Pi = " << std::setprecision(n+1) << pi_leibniz << '\n';
cout << "The series was iterated " << count << " times\n";
return 0;
}
I have summarized the results of several runs in this table:
digits Pi iterations
---------------------------------------
0 3 8
1 3.1 26
2 3.14 628
3 3.141 2,455
4 3.1415 136,121
5 3.14159 376,848
6 3.141592 2,886,751
7 3.1415926 21,547,007
8 3.14159265 278,609,764
9 3.141592653 2,428,700,925
10 3.1415926535 87,312,058,383
Your program will never terminate, because test==p will never be true. This is a comparison between two double-precision numbers that are calculated differently. Due to round-off errors, they will not be identical, even if you run an infinite number of iterations, and your math is correct (and right now it isn't, because the value of PI in your program is not accurate).
To help you figure out what's going on, print the value of test in each iteration, as well as the distance between test and pi, as follows:
#include<iostream>
using namespace std;
void main() {
double pi = atan(1.0) * 4; // Make sure you have a precise value of PI
double sign = 1.0, sum = 0.0;
for (int i = 1; i < 1000; i += 2) {
sum = sum + (1.0 / i) * sign;
sign = -sign;
double test = 4 * sum;
cout << test << " " << fabs(test - pi) << "\n";
}
}
After you make sure the program works well, change the stopping condition eventually to be based on the distance between test and pi.
for (int i=1; fabs(test-pi)>epsilon; i+=2)

Self numbers in c++

Hey, my friends and I are trying to beat each other's runtimes for generating "Self Numbers" between 1 and a million. I've written mine in c++ and I'm still trying to shave off precious time.
Here's what I have so far,
#include <iostream>
using namespace std;
bool v[1000000];
int main(void) {
long non_self = 0;
for(long i = 1; i < 1000000; ++i) {
if(!(v[i])) std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
v[non_self] = 1;
}
std::cout << "1000000" << '\n';
return 0;
}
The code works fine now, I just want to optimize it.
Any tips? Thanks.
I built an alternate C solution that doesn't require any modulo or division operations:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
int v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
for (j4=0; j4<10; j4++) {
for (j3=0; j3<10; j3++) {
for (j2=0; j2<10; j2++) {
for (j1=0; j1<10; j1++) {
s = j6 + j5 + j4 + j3 + j2 + j1;
v[n + s] = 1;
n++;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%6d\n", n);
}
}
It generates 97786 self numbers including 1 and 1000000.
With output, it takes
real 0m1.419s
user 0m0.060s
sys 0m0.152s
When I redirect output to /dev/null, it takes
real 0m0.030s
user 0m0.024s
sys 0m0.004s
on my 3 Ghz quad core rig.
For comparison, your version produces the same number of numbers, so I assume we're either both correct or equally wrong; but your version chews up
real 0m0.064s
user 0m0.060s
sys 0m0.000s
under the same conditions, or about 2x as much.
That, or the fact that you're using longs, which is unnecessary on my machine. Here, int goes up to 2 billion. Maybe you should check INT_MAX on yours?
Update
I had a hunch that it may be better to calculate the sum piecewise. Here's my new code:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
char v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
int s1, s2, s3, s4, s5;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
s5 = j6 + j5;
for (j4=0; j4<10; j4++) {
s4 = s5 + j4;
for (j3=0; j3<10; j3++) {
s3 = s4 + j3;
for (j2=0; j2<10; j2++) {
s2 = s3 + j2;
for (j1=0; j1<10; j1++) {
v[s2 + j1 + n++] = 1;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%d\n", n);
}
}
...and what do you know, that brought down the time for the top loop from 12 ms to 4 ms. Or maybe 8, my clock seems to be getting a bit jittery way down there.
State of affairs, Summary
The actual finding of self numbers up to 1M is now taking roughly 4 ms, and I'm having trouble measuring any further improvements. On the other hand, as long as output is to the console, it will continue to take about 1.4 seconds, my best efforts to leverage buffering notwithstanding. The I/O time so drastically dwarfs computation time that any further optimization would be essentially futile. Thus, although inspired by further comments, I've decided to leave well enough alone.
All times cited are on my (pretty fast) machine and are for comparison purposes with each other only. Your mileage may vary.
Generate the numbers once, copy the output into your code as a gigantic string. Print the string.
Those mods (%) look expensive. If you are allowed to move to base 16 (or even base 2), then you can probably code this a lot faster. If you have to stay in decimal, try creating an array of digits for each place (units, tens, hundreds) and build some rollover code. That will make summating the numbers far easier.
Alternatively, you could recognise the behaviour of the core self function (let's call it s):
s = n + f(b,n)
where f(b,n) is the sum of the digits of the number n in base b.
For base 10, it's clear that as the ones (also known as least significant) digit moves from 0,1,2,...,9, that n and f(b,n) proceed in lockstep as you move from n to n+1, it's only that 10% of the time that 9 rolls to 0 that it doesnt, so:
f(b,n+1) = f(b,n) + 1 // 90% of the time
thus the core self function s advances as
n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2
s(n+1) = s(n) + 2 // again, 90% of the time
In the remaining (and easily identifiable) 10% of the time, the 9 rolls back to zero and adds one to the next digit, which in the simplest case subtracts (9-1) from the running total, but might cascade up through a series of 9s, to subtract 99-1, 999-1 etc.
So the first optimisation can remove most of the work from 90% of your cycles!
if ((n % 10) != 0)
{
n + f(b,n) = n-1 + f(b,n-1) + 2;
}
or
if ((n % 10) != 0)
{
s = old_s + 2;
}
That should be enough to substantially increase your performance without really changing your algorithm.
If you want more, then work out a simple algorithm for the change between iterations for the remaining 10%.
If you want your output to be fast, it may be worth investigating replacing iostream output with plain old printf() - depends on the rules for winning the competition whether this is important.
Multithread (use different arrays/ranges for every thread). Also, dont use more threads than your number of cpu cores =)
cout or printf within a loop will be slow. If you can remove any prints from a loop you will see significant performance increase.
Since the range is limited (1 to 1000000) the maximum sum of the digits does not exceed 9*6 = 54. This means that to implement the sieve a circular buffer of 54 elements should be perfectly sufficient (and the size of the sieve grows very slowly as the range increases).
You already have a sieve-based solution, but it is based on pre-building the full-length buffer (sieve of 1000000 elements), which is rather inelegant (if not completely unacceptable). The performance of your solution also suffers from non-locality of memory access.
For example, this is a possible very simple implementation
#define N 1000000U
void print_self_numbers(void)
{
#define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */
unsigned char marks[NMARKS] = { 0 };
unsigned i, imark;
for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
{
unsigned digits, sum;
if (!marks[imark])
printf("%u ", i);
else
marks[imark] = 0;
sum = i;
for (digits = i; digits > 0; digits /= 10)
sum += digits % 10;
marks[sum % NMARKS] = 1;
}
}
(I'm not going for the best possible performance in terms of CPU clocks here, just illustrating the key idea with the circular buffer.)
Of course, the range can be easily turned into a parameter of the function, while the size of the curcular buffer can be easily calculated at run-time from the range.
As for "optimizations"... There's no point in trying to optimize the code that contains I/O operations. You won't achieve anything by such optimizations. If you want to analyze the performance of the algorithm itself, you'll have to put the generated numbers into an output array and print them later.
For such simple task, the best option would be to think of alternative algorithms to produce the same result. %10 is not usually considered a fast operation.
Why not use the recurrence relation given on the wikipedia page instead?
That should be blazingly fast.
EDIT: Ignore this .. the recurrence relation generates some but not all of the self numbers.
In fact only very few of them. Thats not particularly clear from thewikipedia page though :(
This may help speed up C++ iostreams output:
cin.tie(0);
ios::sync_with_stdio(false);
Put them in main before you start writing to cout.
I created a CUDA-based solution based on Carl Smotricz's second algorithm. The code to identify Self Numbers itself is extremely fast -- on my machine it executes in ~45 nanoseconds; this is about 150 x faster than Carl Smotricz's algorithm, which ran in 7 milliseconds on my machine.
There is a bottleneck, however, and that seems to be the PCIe interface. It took my code a whopping 43 milliseconds to move the computed data from the graphics card back to RAM. This might be optimizable, and I will look in to this.
Still, 45 nanosedons is pretty darn fast. Scary fast, actually, and I added code to my program which runs Carl Smotricz's algorithm and compares the results for accuracy. The results are accurate. Here is the program output (compiled in VS2008 64-bit, Windows7):
UPDATE
I recompiled this code in release mode with full optimization and using static runtime libraries, with signifigant results. The optimizer seems to have done very well with Carl's algorithm, reducing the runtime from 7 ms to 1 ms. The CUDA implementation sped up as well, from 35 us to 20 us. The memory copy from video card to RAM was unaffected.
Program Output:
Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
[kernel] : 35 us (0.09%)
[memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The code is as follows:
file : main.h
#pragma once
#include <cstdlib>
#include <functional>
typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
return make_pair<int*, size_t>(ptr, size);
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);
inline std::string format_elapsed(double d)
{
char buf[256] = {0};
if( d < 0.00000001 )
{
// show in ps with 4 digits
sprintf(buf, "%0.4f ps", d * 1000000000000.0);
}
else if( d < 0.00001 )
{
// show in ns
sprintf(buf, "%0.0f ns", d * 1000000000.0);
}
else if( d < 0.001 )
{
// show in us
sprintf(buf, "%0.0f us", d * 1000000.0);
}
else if( d < 0.1 )
{
// show in ms
sprintf(buf, "%0.0f ms", d * 1000.0);
}
else if( d <= 60.0 )
{
// show in seconds
sprintf(buf, "%0.2f s", d);
}
else if( d < 3600.0 )
{
// show in min:sec
sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
}
// show in h:min:sec
else
sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));
return buf;
}
inline std::string format_pct(double d)
{
char buf[256] = {0};
sprintf(buf, "%.2f", 100.0 * d);
return buf;
}
file: main.cpp
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>
#include "main.h"
int main()
{
unsigned numVals = 1000000;
int* gold = new int[numVals];
memset(gold, 0, sizeof(int)*numVals);
LARGE_INTEGER li = {0}, li2 = {0};
QueryPerformanceFrequency(&li);
__int64 freq = li.QuadPart;
// get cuda properties...
cudaDeviceProp cdp = {0};
cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;
// first run the reference implementation
QueryPerformanceCounter(&li);
for( int j6=0, n = 0; j6<10; j6++ )
{
for( int j5=0; j5<10; j5++ )
{
for( int j4=0; j4<10; j4++ )
{
for( int j3=0; j3<10; j3++ )
{
for( int j2=0; j2<10; j2++ )
{
for( int j1=0; j1<10; j1++ )
{
int s = j6 + j5 + j4 + j3 + j2 + j1;
gold[n + s] = 1;
n++;
}
}
}
}
}
}
QueryPerformanceCounter(&li2);
__int64 ticks = li2.QuadPart-li.QuadPart;
cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;
// now run the cuda version...
unsigned threads = cdp.maxThreadsPerBlock;
unsigned blocks = numVals/threads;
if( numVals%threads ) ++blocks;
unsigned computeSlots = blocks * threads; // this may be != the number of vals since we want 32-thread warps
// allocate device memory for test
int* deviceTest = 0;
err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);
int* hostTest = new int[numVals]; // the repository for the resulting data on the host
memset(hostTest, 0, sizeof(int)*numVals);
// run the CUDA code...
LARGE_INTEGER li3 = {0}, li4={0};
QueryPerformanceCounter(&li3);
ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
QueryPerformanceCounter(&li4);
__int64 ticksCuda = li4.QuadPart-li3.QuadPart;
cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;
unsigned errorCount = 0;
for( size_t i = 0; i < numVals; ++i )
{
if( gold[i] != hostTest[i] )
{
++errorCount;
}
}
cout << "Number of Errors: " << errorCount << endl;
return 0;
}
file: self.cu
#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"
__global__ void SelfNum(int * slots)
{
__shared__ int N;
N = (blockIdx.x * blockDim.x) + threadIdx.x;
const int numDigits = 10;
__shared__ int digits[numDigits];
for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
{
digits[numDigits-i-1] = temp - 10 * (temp/10) /*temp % 10*/;
}
__shared__ int s;
s = 0;
for( int i = 0; i < numDigits; ++i )
s += digits[i];
slots[N+s] = 1;
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned blocks, const unsigned threads)
{
LARGE_INTEGER li = {0};
QueryPerformanceFrequency(&li);
double freq = (double)li.QuadPart;
LARGE_INTEGER liStart = {0};
QueryPerformanceCounter(&liStart);
// run the kernel
SelfNum<<<blocks, threads>>>(deviceMem.first);
LARGE_INTEGER liKernel = {0};
QueryPerformanceCounter(&liKernel);
cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
LARGE_INTEGER liMemcpy = {0};
QueryPerformanceCounter(&liMemcpy);
// display performance stats
double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;
double pKernel = eKernel/e,
pMemcpy = eMemcpy/e;
cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
<< " [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
<< " [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;
}
UPDATE2:
I refactored my CUDA implementation to try to speed it up a bit. I did this by unrolling loops manually, fixing some questionable use of __shared__ memory which might have been an error, and getting rid of some redundancy.
The output of my new kernel is:
Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
[kernel] : 39 us (1.57%)
[memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The only code I changed is the kernel itself, so that's all I will post here:
__global__ void SelfNum(int * slots)
{
int N = (blockIdx.x * blockDim.x) + threadIdx.x;
int s = 0;
int temp = N;
s += temp - 10 * (temp/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
slots[N+s] = 1;
}
I wonder if multi-threading would help. This algorithm looks like it would lend itself well to multi-threading. (Poor-man's test of this: Create two copies of the program and run them at the same time. If it runs in less than 200% of the time, multi-threading may help).
I was actually surprised that the code below was faster then any other posted here. I probably measured it wrong, but maybe it helps; or at least is interesting.
#include <iostream>
#include <boost/progress.hpp>
class SelfCalc
{
private:
bool array[1000000];
int non_self;
public:
SelfCalc()
{
memset(&array, 0, sizeof(array));
}
void operator()(const int i)
{
if (!(array[i]))
std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
array[non_self] = true;
}
};
class IntIterator
{
private:
int value;
public:
IntIterator(const int _value):value(_value){}
int operator*(){ return value; }
bool operator!=(const IntIterator &v){ return value != v.value; }
int operator++(){ return ++value; }
};
int main()
{
boost::progress_timer t;
SelfCalc selfCalc;
IntIterator i(1), end(100000);
std::for_each(i, end, selfCalc);
std::cout << 100000 << std::endl;
return 0;
}
Fun problem. The problem as stated does not specify what base it must be in. I fiddled around with it some and wrote a base-2 version. It generates an extra few thousand entries because the termination point of 1,000,000 is not as natural with base-2. This pre-counts the number of bits in a byte for a table lookup. The generation of the result set (without the I/O) took 2.4 ms.
One interesting thing (assuming I wrote it correctly) is that the base-2 version has about 250,000 "self numbers" up to 1,000,000 while there are just under 100,000 base-10 self numbers in that range.
#include <windows.h>
#include <stdio.h>
#include <string.h>
void StartTimer( _int64 *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
double StopTimer( _int64 t1 )
{
_int64 t2, ldFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&ldFreq );
return ((double)( t2 - t1 ) / (double)ldFreq) * 1000.0;
}
#define RANGE 1000000
char sn[0x100000 + 32];
int bitCount[256];
// precompute bitcounts for each byte
void PreCountBits()
{
int i;
// generate count of bits in each byte
memset( bitCount, 0, sizeof( bitCount ));
for ( i = 0; i < 256; i++ )
{
int tmp = i;
while ( tmp )
{
if ( tmp & 0x01 )
bitCount[i]++;
tmp >>= 1;
}
}
}
void GenBase2( )
{
int i;
int *b1, *b2, *b3;
int b1sum, b2sum, b3sum;
i = 0;
for ( b1 = bitCount; b1 < bitCount + 256; b1++ )
{
b1sum = *b1;
for ( b2 = bitCount; b2 < bitCount + 256; b2++ )
{
b2sum = b1sum + *b2;
for ( b3 = bitCount; b3 < bitCount + 256; b3++ )
{
sn[i++ + *b3 + b2sum] = 1;
}
}
// 1000000 does not provide a great termination number for base 2. So check
// here. Overshoots the target some but avoids repeated checks
if ( i > RANGE )
return;
}
}
int main( int argc, char* argv[] )
{
int i = 0;
__int64 t1;
memset( sn, 0, sizeof( sn ));
StartTimer( &t1 );
PreCountBits();
GenBase2();
printf( "Generation time = %.3f\n", StopTimer( t1 ));
#if 1
for ( i = 1; i <= RANGE; i++ )
if ( !sn[i] ) printf( "%d\n", i );
#endif
return 0;
}
Maybe try just computing the recurrence relation defined below?
http://en.wikipedia.org/wiki/Self_number