Is there limitation for recursion? - c++

I am testing recursion, however when I have an array with more than 150000 elements segmentation error occurs. What can be the problem?
#include <iostream>
using namespace std;
void init ( float a[] , long int n );
float standard ( float a[] , long int n , long int i );
int main()
{
long int n = 1000000;
float *a = new float[n];
init ( a , n );
cout.precision ( 30 );
cout << "I got here." << endl;
cout << "Standard sum= " << standard ( a , 0 , n - 1 ) << endl;
delete [] a;
return 0;
}
void init ( float a[] , long int n )
{
for (long int i = 0 ; i < n ; i++ )
{
a[i] = 1. / ( i + 1. );
}
}
float standard ( float a[] , long int i , long int n )
{
if ( i <= n )
return a[i] + standard ( a , i + 1 , n );
return 0;
}

As an expansion to MicroVirus' correct answer, here is an example of tail recursive version of your algorithm:
float standard_recursion(float* a, long i, long n, long result) {
if(i > n)
return result;
return standard_recursion(a, i + 1, n, result + a[i]);
}
float standard(float* a, long i, long n ) {
return standard_recursion(a, i, n, 0);
}
This should run if the compiler does tail call optimization (I tested on g++ -O2). However, since the functionality depends on the compiler optimization, I would recommend to avoid deep recursion entirely and opt for iterative solution.

You are most likely running out of stack space in your recursive function standard, which recurses with a depth of n, and tail-call optimisation is probably not enabled here.
So, to answer the question in your title: Yes, there is a limit to recursion, and usually it's the available stack space.

Probably you are out of memory on heap. Also if you got 16bit int, there could be a problem with iterations. Better use int32_t i instead of int i. Same with n.

Related

Can we add an integer to an array in c++

#include <bits/stdc++.h>
using namespace std;
/*Prototype for utility functions */
void printArray(int arr[], int size);
void swap(int arr[], int fi, int si, int d);
void leftRotate(int arr[], int d, int n)
{
/* Return If number of elements to be rotated
is zero or equal to array size */
if(d == 0 || d == n)
return;
/*If number of elements to be rotated
is exactly half of array size */
if(n - d == d)
{
swap(arr, 0, n - d, d);
return;
}
/* If A is shorter*/
if(d < n - d)
{
swap(arr, 0, n - d, d);
leftRotate(arr, d, n - d);
}
else /* If B is shorter*/
{
swap(arr, 0, d, n - d);
leftRotate(arr + n - d, 2 * d - n, d); /*This is tricky*/
}
}
/*UTILITY FUNCTIONS*/
/* function to print an array */
void printArray(int arr[], int size)
{
int i;
for(i = 0; i < size; i++)
cout << arr[i] << " ";
cout << endl;
}
/*This function swaps d elements starting at index fi
with d elements starting at index si */
void swap(int arr[], int fi, int si, int d)
{
int i, temp;
for(i = 0; i < d; i++)
{
temp = arr[fi + i];
arr[fi + i] = arr[si + i];
arr[si + i] = temp;
}
}
// Driver Code
int main()
{
int arr[] = {1, 2, 3, 4, 5, 6, 7};
leftRotate(arr, 2, 7);
printArray(arr, 7);
return 0;
}
// This code is contributed by Rath Bhupendra
I found this code on the geek for geeks website. The code is used to rotate the elements of an array. It is mentioned as block swap algorithm in the website, my questions are:
Can we add integers to an array in c++ as given in the else part of the left rotate function while passing the arguments (arr+n-d)?
How can we add integers to an array?
I tried adding an integer to an array in an online compiler and it didn't work. But the above code works perfectly giving the desired output 34567.
The link to the website is https://www.geeksforgeeks.org/block-swap-algorithm-for-array-rotation/.
Can we add integers to an array in c++ as given in the else part of the left rotate function while passing the arguments (arr+n-d)?
How can we add integers to an array?
The answer is you can't, and that's not what's happening here.
int arr[] argument decays to a pointer to the first element of the array. It's the same as having int* arr so what you are doing in arr + n - d is simple pointer arithmetic.
The pointer will be moved n - d positions relative to the position it's at before the expression is evaluated.
Supposing the result of n - d is 4, and arr is pointing to the beginning of the array passed as an argument, that is to &arr[0] (in array notation) or arr + 0 (in pointer notation), which is where it's pointing to in its inicial state, you'll have arr + 4 or &arr[4], after the evaluation, the expression provides access to the address of index 4 (the 5th element of the array). To access the value within that address you'd use *(arr + 4) or arr[4].
On a side note I wouldn't advise the use of geeksforgeeks.com to learn C++, or any other language, for that matter, this should be done by reading a good book.
A function parameter having an array type is adjusted by the compiler to pointer to the array element type. That is these two function declarations are equivalent and declare the same one function.
void leftRotate(int arr[], int d, int n);
and
void leftRotate(int *arr, int d, int n);
You even may write for example
void leftRotate(int arr[100], int d, int n);
void leftRotate(int arr[10], int d, int n);
void leftRotate(int arr[1], int d, int n);
Again these declarations declare the function
void leftRotate(int *arr, int d, int n);
So within the function this expression
arr + n - d
uses the pointer arithmetic applied to the pointer arr.
For example the expression arr + 0 is equivalent to arr and points to the first element of the array. The expression arr + n points to the n-th element of the array.
Here is a demonstrative program where there is used the pointer arithmetic to output elements of an array in a loop.
#include <iostream>
int main()
{
int a[] = { 1, 2, 3, 4, 5 };
for ( size_t i = 0; i < sizeof( a ) / sizeof( *a ); i++ )
{
std::cout << *( a + i ) << ' ';
}
std::cout << '\n';
return 0;
}
The program output is
1 2 3 4 5
In the expression *( a + i ) the array designator a is implicitly converted to pointer to its first element.
Here is one more demonstrative program that shows that a function parameter having an array type is adjusted by the compiler to pointer to the array element type.
#include <iostream>
#include <iomanip>
#include <type_traits>
const size_t N = 100;
void f( int a[N] )
{
std::cout << "\nin function\n";
std::cout << "sizeof( a ) = " << sizeof( a ) << '\n';
std::cout << "a is a pointer " << std::boolalpha <<std:: is_same<decltype( a ), int *>::value << '\n';
}
int main()
{
int a[N];
std::cout << "In main\n";
std::cout << "sizeof( a ) = " << sizeof( a ) << '\n';
std::cout << "a is an array " << std::boolalpha <<std:: is_same<decltype( a ), int [N]>::value << '\n';
f( a );
return 0;
}
The program output is
In main
sizeof( a ) = 400
a is an array true
in function
sizeof( a ) = 8
a is a pointer true

Converting an array of 2 digit numbers into an integer (C++)

Is it possible to take an array filled with 2 digit numbers e.g.
[10,11,12,13,...]
and multiply each element in the list by 100^(position in the array) and sum the result so that:
mysteryFunction[10,11,12] //The function performs 10*100^0 + 11*100^1 + 12*100^3
= 121110
and also
mysteryFunction[10,11,12,13]
= 13121110
when I do not know the number of elements in the array?
(yes, the reverse of order is intended but not 100% necessary, and just in case you missed it the first time the numbers will always be 2 digits)
Just for a bit of background to the problem: this is to try to improve my attempt at an RSA encryption program, at the moment I am multiplying each member of the array by 100^(the position of the number) written out each time which means that each word which I use to encrypt must be a certain length.
For example to encrypt "ab" I have converted it to an array [10,11] but need to convert it to 1110 before I can put it through the RSA algorithm. I would need to adjust my code for if I then wanted to use a three letter word, again for a four letter word etc. which I'm sure you will agree is not ideal. My code is nothing like industry standard but I am happy to upload it should anyone want to see it (I have also already managed this in Haskell if anyone would like to see that). I thought that the background information was necessary just so that I don't get hundreds of downvotes from people thinking that I'm trying to trick them into doing homework for me. Thank you very much for any help, I really do appreciate it!
EDIT: Thank you for all of the answers! They perfectly answer the question that I asked but I am having problems incorporating them into my current program, if I post my code so far would you be able to help? When I tried to include the answer provided I got an error message (I can't vote up because I don't have enough reputation, sorry that I haven't accepted any answers yet).
#include <iostream>
#include <string>
#include <math.h>
int returnVal (char x)
{
return (int) x;
}
unsigned long long modExp(unsigned long long b, unsigned long long e, unsigned long long m)
{
unsigned long long remainder;
int x = 1;
while (e != 0)
{
remainder = e % 2;
e= e/2;
if (remainder == 1)
x = (x * b) % m;
b= (b * b) % m;
}
return x;
}
int main()
{
unsigned long long p = 80001;
unsigned long long q = 70021;
int e = 7;
unsigned long long n = p * q;
std::string foo = "ab";
for (int i = 0; i < foo.length(); i++);
{
std::cout << modExp (returnVal((foo[0]) - 87) + returnVal (foo[1] -87) * 100, e, n);
}
}
If you want to use plain C-style arrays, you will have to separately know the number of entries. With this approach, your mysterious function might be defined like this:
unsigned mysteryFunction(unsigned numbers[], size_t n)
{
unsigned result = 0;
unsigned factor = 1;
for (size_t i = 0; i < n; ++i)
{
result += factor * numbers[i];
factor *= 100;
}
return result;
}
You can test this code with the following:
#include <iostream>
int main()
{
unsigned ar[] = {10, 11, 12, 13};
std::cout << mysteryFunction(ar, 4) << "\n";
return 0;
}
On the other hand, if you want to utilize the STL's vector class, you won't separately need the size. The code itself won't need too many changes.
Also note that the built-in integer types cannot handle very large numbers, so you might want to look into an arbitrary precision number library, like GMP.
EDIT: Here's a version of the function which accepts a std::string and uses the characters' ASCII values minus 87 as the numbers:
unsigned mysteryFunction(const std::string& input)
{
unsigned result = 0;
unsigned factor = 1;
for (size_t i = 0; i < input.size(); ++i)
{
result += factor * (input[i] - 87);
factor *= 100;
}
return result;
}
The test code becomes:
#include <iostream>
#include <string>
int main()
{
std::string myString = "abcde";
std::cout << mysteryFunction(myString) << "\n";
return 0;
}
The program prints: 1413121110
As benedek mentioned, here's an implementation using dynamic arrays via std::vector.
unsigned mystery(std::vector<unsigned> vect)
{
unsigned result = 0;
unsigned factor = 1;
for (auto& item : vect)
{
result += factor * item;
factor *= 100;
}
return result;
}
void main(void)
{
std::vector<unsigned> ar;
ar.push_back(10);
ar.push_back(11);
ar.push_back(12);
ar.push_back(13);
std::cout << mystery(ar);
}
I would like to suggest the following solutions.
You could use standard algorithm std::accumulate declared in header <numeric>
For example
#include <iostream>
#include <numeric>
int main()
{
unsigned int a[] = { 10, 11, 12, 13 };
unsigned long long i = 1;
unsigned long long s =
std::accumulate( std::begin( a ), std::end( a ), 0ull,
[&]( unsigned long long acc, unsigned int x )
{
return ( acc += x * i, i *= 100, acc );
} );
std::cout << "s = " << s << std::endl;
return 0;
}
The output is
s = 13121110
The same can be done with using the range based for statement
#include <iostream>
#include <numeric>
int main()
{
unsigned int a[] = { 10, 11, 12, 13 };
unsigned long long i = 1;
unsigned long long s = 0;
for ( unsigned int x : a )
{
s += x * i; i *= 100;
}
std::cout << "s = " << s << std::endl;
return 0;
}
You could also write a separate function
unsigned long long mysteryFunction( const unsigned int a[], size_t n )
{
unsigned long long s = 0;
unsigned long long i = 1;
for ( size_t k = 0; k < n; k++ )
{
s += a[k] * i; i *= 100;
}
return s;
}
Also think about using std::string instead of integral numbers to keep an encrypted result.

Counting number of digits in an integer through recursion

My code is following:
/counting number of digits in an integer
#include <iostream>
using namespace std;
int countNum(int n,int d){
if(n==0)
return d;
else
return (n/10,d++);
}
int main(){
int n;
int d;
cout<<"Enter number"<<endl;
cin>>n;
int x=countNum();
cout<<x;
return 0;
}
i cannot figure out the error,it says that
: too few arguments to function `int countNum(int, int)'
what is issue?
Because you declared the function to take two arguments:
int countNum(int n,int d){
and you are passing none in:
int x = countNum();
You probably meant to call it like this, instead:
int x = countNum(n, d);
Also this:
return (n/10,d++);
should probably be this:
return countNum(n/10,d++);
Also you are not initializing your n and d variables:
int n;
int d;
Finally you don't need the d argument at all. Here's a better version:
int countNum(int n){
return (n >= 10)
? 1 + countNum(n/10)
: 1;
}
and here's the working example.
int x=countNum(); the caller function should pass actual arguments to calling function. You have defined function countNum(int, int) which means it will receive two ints as arguments from the calling function, so the caller should pass them which are missing in your case. Thats the reason of error too few arguments.
Your code here:
int x=countNum();
countNum needs to be called with two integers. eg
int x=countNum(n, d);
Because you haven't passed parameters to the countNum function. Use it like int x=countNum(n,d);
Assuming this is not for an assignment, there are better ways to do this (just a couple of examples):
Convert to string
unsigned int count_digits(unsigned int n)
{
std::string sValue = std::to_string(n);
return sValue.length();
}
Loop
unsigned int count_digits(unsigned int n)
{
unsigned int cnt = 1;
if (n > 0)
{
for (n = n/10; n > 0; n /= 10, ++cnt);
}
return cnt;
}
Tail End Recursion
unsigned int count_digits(unsigned int n, unsigned int cnt = 1)
{
if (n < 10)
return cnt;
else
return count_digits(n / 10, cnt + 1);
}
Note: With tail-end recursion optimizations turned on, your compiler will transform this into a loop for you - preventing the unnecessary flooding of the call stack.
Change it to:
int x=countNum(n,0);
You don't need to pass d in, you can just pass 0 as the seed.
Also change countNum to this:
int countNum(int n,int d){
if(n==0)
return d;
else
return coutNum(n/10,d+1); // NOTE: This is the recursive bit!
}
#include <iostream>
using namespace std;
int countNum(int n,int d){
if(n<10)
return d;
else
return countNum(n/10, d+1);
}
int main(){
int n;
cout<<"Enter number"<<endl;
cin>>n;
int x=countNum(n, 1);
cout<<x;
return 0;
}
Your function is written incorrectly. For example it is not clear why it has two parameters or where it calls recursively itself.
I would write it the following way
int countNum( int n )
{
return 1 + ( ( n /= 10 ) ? countNum( n ) : 0 );
}
Or even it would be better to define it as
constexpr int countNum( int n )
{
return 1 + ( ( n / 10 ) ? countNum( n/10 ) : 0 );
}

Find n-th root of all numbers within an interval

I have a program which must print perfect square roots of all integer numbers within an interval. Now I want to do that for n-the root.
Here's what I've done, but I'm stuck at fmod.
#include <iostream>
#include <math.h>
using namespace std;
int nroot(int, int);
int main()
{
int p, min, max,i;
double q;
cout << "enter min and max of the interval \n";
cin >> min;
cin >> max;
cout << "\n Enter the n-th root \n";
cin >> p;
i = min;
while (i <= max)
{
if (fmod((nroot(i, p)), 1.0) == 0)
{
cout << nroot(i, p);
}
i++;
}
return 0;
}
int nroot (int i, int p){
float q;
q = (pow(i, (1.0 / p)));
return q;
}
You may want to tackle this in the opposite direction. Rather than taking the nth root of every value in the interval to see if the nth root is an integer, instead take the nth root of the bounds of the interval, and step in terms of the roots:
// Assume 'min' and 'max' set as above in your original program.
// Assume 'p' holds which root we're taking (ie. p = 3 means cube root)
int min_root = int( floor( pow( min, 1. / p ) ) );
int max_root = int( ceil ( pow( max, 1. / p ) ) );
for (int root = min_root; root <= max_root; root++)
{
int raised = int( pow( root, p ) );
if (raised >= min && raised <= max)
cout << root << endl;
}
The additional test inside the for loop is to handle cases where min or max land directly on a root, or just to the side of a root.
You can remove the test and computation from the loop by recognizing that raised is only needed at the boundaries of the loop. This version, while slightly more complex looking, implements that observation:
// Assume 'min' and 'max' set as above in your original program.
// Assume 'p' holds which root we're taking (ie. p = 3 means cube root)
int min_root = int( floor( pow( min, 1. / p ) ) );
int max_root = int( ceil ( pow( max, 1. / p ) ) );
if ( int( pow( min_root, p ) ) < min )
min_root++;
if ( int( pow( max_root, p ) ) > max )
max_root--;
for (int root = min_root; root <= max_root; root++)
cout << root << endl;
If you're really concerned about performance (which I suspect you are not in this case), you can replace int( pow( ..., p ) ) with code that computes the nth power entirely with integer arithmetic. That seems like overkill, though.
Exact equality test for floating numbers might not work as you expect. It's better to compare with some small number:
float t = nroot(i, p);
if (fabs(t - rintf(t)) <= 0.00000001)
{
cout << t << endl;
}
Even in this case you aren't guaranteed to get correct results for all values of min, max and p. All depends on this small number and precision you represent numbers. You might consider longer floating types like "double" and "long double".

Self numbers in c++

Hey, my friends and I are trying to beat each other's runtimes for generating "Self Numbers" between 1 and a million. I've written mine in c++ and I'm still trying to shave off precious time.
Here's what I have so far,
#include <iostream>
using namespace std;
bool v[1000000];
int main(void) {
long non_self = 0;
for(long i = 1; i < 1000000; ++i) {
if(!(v[i])) std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
v[non_self] = 1;
}
std::cout << "1000000" << '\n';
return 0;
}
The code works fine now, I just want to optimize it.
Any tips? Thanks.
I built an alternate C solution that doesn't require any modulo or division operations:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
int v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
for (j4=0; j4<10; j4++) {
for (j3=0; j3<10; j3++) {
for (j2=0; j2<10; j2++) {
for (j1=0; j1<10; j1++) {
s = j6 + j5 + j4 + j3 + j2 + j1;
v[n + s] = 1;
n++;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%6d\n", n);
}
}
It generates 97786 self numbers including 1 and 1000000.
With output, it takes
real 0m1.419s
user 0m0.060s
sys 0m0.152s
When I redirect output to /dev/null, it takes
real 0m0.030s
user 0m0.024s
sys 0m0.004s
on my 3 Ghz quad core rig.
For comparison, your version produces the same number of numbers, so I assume we're either both correct or equally wrong; but your version chews up
real 0m0.064s
user 0m0.060s
sys 0m0.000s
under the same conditions, or about 2x as much.
That, or the fact that you're using longs, which is unnecessary on my machine. Here, int goes up to 2 billion. Maybe you should check INT_MAX on yours?
Update
I had a hunch that it may be better to calculate the sum piecewise. Here's my new code:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
char v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
int s1, s2, s3, s4, s5;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
s5 = j6 + j5;
for (j4=0; j4<10; j4++) {
s4 = s5 + j4;
for (j3=0; j3<10; j3++) {
s3 = s4 + j3;
for (j2=0; j2<10; j2++) {
s2 = s3 + j2;
for (j1=0; j1<10; j1++) {
v[s2 + j1 + n++] = 1;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%d\n", n);
}
}
...and what do you know, that brought down the time for the top loop from 12 ms to 4 ms. Or maybe 8, my clock seems to be getting a bit jittery way down there.
State of affairs, Summary
The actual finding of self numbers up to 1M is now taking roughly 4 ms, and I'm having trouble measuring any further improvements. On the other hand, as long as output is to the console, it will continue to take about 1.4 seconds, my best efforts to leverage buffering notwithstanding. The I/O time so drastically dwarfs computation time that any further optimization would be essentially futile. Thus, although inspired by further comments, I've decided to leave well enough alone.
All times cited are on my (pretty fast) machine and are for comparison purposes with each other only. Your mileage may vary.
Generate the numbers once, copy the output into your code as a gigantic string. Print the string.
Those mods (%) look expensive. If you are allowed to move to base 16 (or even base 2), then you can probably code this a lot faster. If you have to stay in decimal, try creating an array of digits for each place (units, tens, hundreds) and build some rollover code. That will make summating the numbers far easier.
Alternatively, you could recognise the behaviour of the core self function (let's call it s):
s = n + f(b,n)
where f(b,n) is the sum of the digits of the number n in base b.
For base 10, it's clear that as the ones (also known as least significant) digit moves from 0,1,2,...,9, that n and f(b,n) proceed in lockstep as you move from n to n+1, it's only that 10% of the time that 9 rolls to 0 that it doesnt, so:
f(b,n+1) = f(b,n) + 1 // 90% of the time
thus the core self function s advances as
n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2
s(n+1) = s(n) + 2 // again, 90% of the time
In the remaining (and easily identifiable) 10% of the time, the 9 rolls back to zero and adds one to the next digit, which in the simplest case subtracts (9-1) from the running total, but might cascade up through a series of 9s, to subtract 99-1, 999-1 etc.
So the first optimisation can remove most of the work from 90% of your cycles!
if ((n % 10) != 0)
{
n + f(b,n) = n-1 + f(b,n-1) + 2;
}
or
if ((n % 10) != 0)
{
s = old_s + 2;
}
That should be enough to substantially increase your performance without really changing your algorithm.
If you want more, then work out a simple algorithm for the change between iterations for the remaining 10%.
If you want your output to be fast, it may be worth investigating replacing iostream output with plain old printf() - depends on the rules for winning the competition whether this is important.
Multithread (use different arrays/ranges for every thread). Also, dont use more threads than your number of cpu cores =)
cout or printf within a loop will be slow. If you can remove any prints from a loop you will see significant performance increase.
Since the range is limited (1 to 1000000) the maximum sum of the digits does not exceed 9*6 = 54. This means that to implement the sieve a circular buffer of 54 elements should be perfectly sufficient (and the size of the sieve grows very slowly as the range increases).
You already have a sieve-based solution, but it is based on pre-building the full-length buffer (sieve of 1000000 elements), which is rather inelegant (if not completely unacceptable). The performance of your solution also suffers from non-locality of memory access.
For example, this is a possible very simple implementation
#define N 1000000U
void print_self_numbers(void)
{
#define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */
unsigned char marks[NMARKS] = { 0 };
unsigned i, imark;
for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
{
unsigned digits, sum;
if (!marks[imark])
printf("%u ", i);
else
marks[imark] = 0;
sum = i;
for (digits = i; digits > 0; digits /= 10)
sum += digits % 10;
marks[sum % NMARKS] = 1;
}
}
(I'm not going for the best possible performance in terms of CPU clocks here, just illustrating the key idea with the circular buffer.)
Of course, the range can be easily turned into a parameter of the function, while the size of the curcular buffer can be easily calculated at run-time from the range.
As for "optimizations"... There's no point in trying to optimize the code that contains I/O operations. You won't achieve anything by such optimizations. If you want to analyze the performance of the algorithm itself, you'll have to put the generated numbers into an output array and print them later.
For such simple task, the best option would be to think of alternative algorithms to produce the same result. %10 is not usually considered a fast operation.
Why not use the recurrence relation given on the wikipedia page instead?
That should be blazingly fast.
EDIT: Ignore this .. the recurrence relation generates some but not all of the self numbers.
In fact only very few of them. Thats not particularly clear from thewikipedia page though :(
This may help speed up C++ iostreams output:
cin.tie(0);
ios::sync_with_stdio(false);
Put them in main before you start writing to cout.
I created a CUDA-based solution based on Carl Smotricz's second algorithm. The code to identify Self Numbers itself is extremely fast -- on my machine it executes in ~45 nanoseconds; this is about 150 x faster than Carl Smotricz's algorithm, which ran in 7 milliseconds on my machine.
There is a bottleneck, however, and that seems to be the PCIe interface. It took my code a whopping 43 milliseconds to move the computed data from the graphics card back to RAM. This might be optimizable, and I will look in to this.
Still, 45 nanosedons is pretty darn fast. Scary fast, actually, and I added code to my program which runs Carl Smotricz's algorithm and compares the results for accuracy. The results are accurate. Here is the program output (compiled in VS2008 64-bit, Windows7):
UPDATE
I recompiled this code in release mode with full optimization and using static runtime libraries, with signifigant results. The optimizer seems to have done very well with Carl's algorithm, reducing the runtime from 7 ms to 1 ms. The CUDA implementation sped up as well, from 35 us to 20 us. The memory copy from video card to RAM was unaffected.
Program Output:
Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
[kernel] : 35 us (0.09%)
[memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The code is as follows:
file : main.h
#pragma once
#include <cstdlib>
#include <functional>
typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
return make_pair<int*, size_t>(ptr, size);
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);
inline std::string format_elapsed(double d)
{
char buf[256] = {0};
if( d < 0.00000001 )
{
// show in ps with 4 digits
sprintf(buf, "%0.4f ps", d * 1000000000000.0);
}
else if( d < 0.00001 )
{
// show in ns
sprintf(buf, "%0.0f ns", d * 1000000000.0);
}
else if( d < 0.001 )
{
// show in us
sprintf(buf, "%0.0f us", d * 1000000.0);
}
else if( d < 0.1 )
{
// show in ms
sprintf(buf, "%0.0f ms", d * 1000.0);
}
else if( d <= 60.0 )
{
// show in seconds
sprintf(buf, "%0.2f s", d);
}
else if( d < 3600.0 )
{
// show in min:sec
sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
}
// show in h:min:sec
else
sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));
return buf;
}
inline std::string format_pct(double d)
{
char buf[256] = {0};
sprintf(buf, "%.2f", 100.0 * d);
return buf;
}
file: main.cpp
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>
#include "main.h"
int main()
{
unsigned numVals = 1000000;
int* gold = new int[numVals];
memset(gold, 0, sizeof(int)*numVals);
LARGE_INTEGER li = {0}, li2 = {0};
QueryPerformanceFrequency(&li);
__int64 freq = li.QuadPart;
// get cuda properties...
cudaDeviceProp cdp = {0};
cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;
// first run the reference implementation
QueryPerformanceCounter(&li);
for( int j6=0, n = 0; j6<10; j6++ )
{
for( int j5=0; j5<10; j5++ )
{
for( int j4=0; j4<10; j4++ )
{
for( int j3=0; j3<10; j3++ )
{
for( int j2=0; j2<10; j2++ )
{
for( int j1=0; j1<10; j1++ )
{
int s = j6 + j5 + j4 + j3 + j2 + j1;
gold[n + s] = 1;
n++;
}
}
}
}
}
}
QueryPerformanceCounter(&li2);
__int64 ticks = li2.QuadPart-li.QuadPart;
cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;
// now run the cuda version...
unsigned threads = cdp.maxThreadsPerBlock;
unsigned blocks = numVals/threads;
if( numVals%threads ) ++blocks;
unsigned computeSlots = blocks * threads; // this may be != the number of vals since we want 32-thread warps
// allocate device memory for test
int* deviceTest = 0;
err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);
int* hostTest = new int[numVals]; // the repository for the resulting data on the host
memset(hostTest, 0, sizeof(int)*numVals);
// run the CUDA code...
LARGE_INTEGER li3 = {0}, li4={0};
QueryPerformanceCounter(&li3);
ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
QueryPerformanceCounter(&li4);
__int64 ticksCuda = li4.QuadPart-li3.QuadPart;
cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;
unsigned errorCount = 0;
for( size_t i = 0; i < numVals; ++i )
{
if( gold[i] != hostTest[i] )
{
++errorCount;
}
}
cout << "Number of Errors: " << errorCount << endl;
return 0;
}
file: self.cu
#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"
__global__ void SelfNum(int * slots)
{
__shared__ int N;
N = (blockIdx.x * blockDim.x) + threadIdx.x;
const int numDigits = 10;
__shared__ int digits[numDigits];
for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
{
digits[numDigits-i-1] = temp - 10 * (temp/10) /*temp % 10*/;
}
__shared__ int s;
s = 0;
for( int i = 0; i < numDigits; ++i )
s += digits[i];
slots[N+s] = 1;
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned blocks, const unsigned threads)
{
LARGE_INTEGER li = {0};
QueryPerformanceFrequency(&li);
double freq = (double)li.QuadPart;
LARGE_INTEGER liStart = {0};
QueryPerformanceCounter(&liStart);
// run the kernel
SelfNum<<<blocks, threads>>>(deviceMem.first);
LARGE_INTEGER liKernel = {0};
QueryPerformanceCounter(&liKernel);
cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
LARGE_INTEGER liMemcpy = {0};
QueryPerformanceCounter(&liMemcpy);
// display performance stats
double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;
double pKernel = eKernel/e,
pMemcpy = eMemcpy/e;
cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
<< " [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
<< " [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;
}
UPDATE2:
I refactored my CUDA implementation to try to speed it up a bit. I did this by unrolling loops manually, fixing some questionable use of __shared__ memory which might have been an error, and getting rid of some redundancy.
The output of my new kernel is:
Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
[kernel] : 39 us (1.57%)
[memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The only code I changed is the kernel itself, so that's all I will post here:
__global__ void SelfNum(int * slots)
{
int N = (blockIdx.x * blockDim.x) + threadIdx.x;
int s = 0;
int temp = N;
s += temp - 10 * (temp/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
slots[N+s] = 1;
}
I wonder if multi-threading would help. This algorithm looks like it would lend itself well to multi-threading. (Poor-man's test of this: Create two copies of the program and run them at the same time. If it runs in less than 200% of the time, multi-threading may help).
I was actually surprised that the code below was faster then any other posted here. I probably measured it wrong, but maybe it helps; or at least is interesting.
#include <iostream>
#include <boost/progress.hpp>
class SelfCalc
{
private:
bool array[1000000];
int non_self;
public:
SelfCalc()
{
memset(&array, 0, sizeof(array));
}
void operator()(const int i)
{
if (!(array[i]))
std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
array[non_self] = true;
}
};
class IntIterator
{
private:
int value;
public:
IntIterator(const int _value):value(_value){}
int operator*(){ return value; }
bool operator!=(const IntIterator &v){ return value != v.value; }
int operator++(){ return ++value; }
};
int main()
{
boost::progress_timer t;
SelfCalc selfCalc;
IntIterator i(1), end(100000);
std::for_each(i, end, selfCalc);
std::cout << 100000 << std::endl;
return 0;
}
Fun problem. The problem as stated does not specify what base it must be in. I fiddled around with it some and wrote a base-2 version. It generates an extra few thousand entries because the termination point of 1,000,000 is not as natural with base-2. This pre-counts the number of bits in a byte for a table lookup. The generation of the result set (without the I/O) took 2.4 ms.
One interesting thing (assuming I wrote it correctly) is that the base-2 version has about 250,000 "self numbers" up to 1,000,000 while there are just under 100,000 base-10 self numbers in that range.
#include <windows.h>
#include <stdio.h>
#include <string.h>
void StartTimer( _int64 *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
double StopTimer( _int64 t1 )
{
_int64 t2, ldFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&ldFreq );
return ((double)( t2 - t1 ) / (double)ldFreq) * 1000.0;
}
#define RANGE 1000000
char sn[0x100000 + 32];
int bitCount[256];
// precompute bitcounts for each byte
void PreCountBits()
{
int i;
// generate count of bits in each byte
memset( bitCount, 0, sizeof( bitCount ));
for ( i = 0; i < 256; i++ )
{
int tmp = i;
while ( tmp )
{
if ( tmp & 0x01 )
bitCount[i]++;
tmp >>= 1;
}
}
}
void GenBase2( )
{
int i;
int *b1, *b2, *b3;
int b1sum, b2sum, b3sum;
i = 0;
for ( b1 = bitCount; b1 < bitCount + 256; b1++ )
{
b1sum = *b1;
for ( b2 = bitCount; b2 < bitCount + 256; b2++ )
{
b2sum = b1sum + *b2;
for ( b3 = bitCount; b3 < bitCount + 256; b3++ )
{
sn[i++ + *b3 + b2sum] = 1;
}
}
// 1000000 does not provide a great termination number for base 2. So check
// here. Overshoots the target some but avoids repeated checks
if ( i > RANGE )
return;
}
}
int main( int argc, char* argv[] )
{
int i = 0;
__int64 t1;
memset( sn, 0, sizeof( sn ));
StartTimer( &t1 );
PreCountBits();
GenBase2();
printf( "Generation time = %.3f\n", StopTimer( t1 ));
#if 1
for ( i = 1; i <= RANGE; i++ )
if ( !sn[i] ) printf( "%d\n", i );
#endif
return 0;
}
Maybe try just computing the recurrence relation defined below?
http://en.wikipedia.org/wiki/Self_number