c++ measuring execution time - c++

EDIT: the code was modified following the advices givven to me.
The purpose of the following code is to measure the execution time of a simple operation (1+1) and of a call to a function who does nothing (foo).
The code compiles and seeme to work properly, but the results i am getting are weird - it seems the basic operation requires about the same time as the function call, and most times it take even a little bit more time.
Another issue is that the execution time do not seem to be affected by the number of itterrations - it could be 100K or 100M but the times are basically the same. Also, if I pick a number over one billion, it seeme the execuition time decreases.
i could not find on google what i need - i must time a simple opertaion and an empty function, which should be in the same file as measureTimes() is, or at least - each measuring function should be contained wholy in a single file (and besides, moving foo() to another file actually reduced times so far
right now, this is the output of my program:
instructionTimeNanoSecond: 1.9
functionTimeNanoSecond: 1.627
#include <iostream>
#include <unistd.h>
#include <string.h>
#include <sys/time.h>
#include <math.h>
#include "osm.h"
#define INVALID_ITERATIONS 0
#define DEFAULT_ITERATIONS 1000
#define HOST_NAME_LEN 100
#define TO_NANO 1000
#define TO_MICRO 1000000
#define ROLL 10
using namespace std;
int main()
{
unsigned int iterations = (unsigned int) pow( 10, 9);
measureTimes( iterations, iterations, iterations, iterations );
return 0;
}
void foo();
timeMeasurmentStructure measureTimes (unsigned int operation_iterations,
unsigned int function_iterations,
)
{
double functionTimeNanoSecond;
functionTimeNanoSecond = osm_function_time( function_iterations);
cout << "functionTimeNanoSecond: " << functionTimeNanoSecond << "\n";;
double instructionTimeNanoSecond;
instructionTimeNanoSecond = osm_operation_time( operation_iterations);
cout << "instructionTimeNanoSecond: " << instructionTimeNanoSecond << "\n";
}
double osm_operation_time(unsigned int iterations)
{
timeval start;
gettimeofday(&start, NULL);
int x=0;
for( int i = 0; i < iterations/ROLL; i++ )
{
x=x+1;
x=x+1;
x=x+1;
x=x+1;
x=x+1;
x=x+1;
x=x+1;
x=x+1;
x=x+1;
x=x+1;
}
timeval end;
gettimeofday(&end, NULL);
timeval diff;
timersub(&end, &start, &diff);
// double micro_seconds =(double) (end.tv_usec - start.tv_usec);
double ret =((double) diff.tv_sec*TO_MICRO + diff.tv_usec) / ((double) iterations);
return ret * TO_NANO;
}
double osm_function_time(unsigned int iterations)
{
timeval start;
gettimeofday(&start, NULL);
for( int i = 0; i < iterations/ROLL; i++ )
{
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
}
timeval end;
gettimeofday(&end, NULL);
timeval diff;
timersub(&end, &start, &diff);
//double micro_seconds = (double)( (end.tv_sec - start.tv_sec)*TO_MICRO+(end.tv_usec - start.tv_usec));
double ret =((double) diff.tv_sec*TO_MICRO + diff.tv_usec) / ((double) iterations);
return ret * TO_NANO;
}
void foo()
{
return;
}

In this case compiler is optimizing your code and probably doesn't even execute the for loop because the code inside 1+1 is not affecting the rest of the program.
If you do something like:
int x = 0;
for(int i=0;i<iteration;i++)
x = x + 1;
cout << x;
you will get more real results

You are always calculating the average.. which should be more or less the same for the no of iterations
double ret = diff.tv_usec / ((double) iterations);

The magic word you're looking for is profiling. It usually is (and should) be supported by your compiler.

Related

How to measure wallclock time in C++ instead of cpu time?

I would like to measure wallclock time taken by my algorithm in C++. Many articles point to this code.
clock_t begin_time, end_time;
begin_time = clock();
Algorithm();
end_time = clock();
cout << ((double)(end_time - begin_time)/CLOCKS_PER_SEC) << endl;
But this measures only cpu time taken by my algorithm.
Some other article pointed out this code.
double getUnixTime(void)
{
struct timespec tv;
if(clock_gettime(CLOCK_REALTIME, &tv) != 0) return 0;
return (tv.tv_sec + (tv.tv_nsec / 1000000000.0));
}
double begin_time, end_time;
begin_time = getUnixTime();
Algorithm();
end_time = getUnixTime();
cout << (double) (end_time - begin_time) << endl;
I thought it would print wallclock time taken by my algorithm. But surprisingly, the time printed by this code is much lower than cpu time printed by previous code. So, I am confused. Please provide code for printing wallclock time.
Those times are probably down in the noise. To get a reasonable time measurement, try executing your algorithm many times in a loop:
const int loops = 1000000;
double begin_time, end_time;
begin_time = getUnixTime();
for (int i = 0; i < loops; ++i)
Algorithm();
end_time = getUnixTime();
cout << (double) (end_time - begin_time) / loops << endl;
I'm getting approximately the same times in a single threaded program:
#include <time.h>
#include <stdio.h>
__attribute((noinline)) void nop(void){}
void loop(unsigned long Cnt) { for(unsigned long i=0; i<Cnt;i++) nop(); }
int main()
{
clock_t t0,t1;
struct timespec ts0,ts1;
t0=clock();
clock_gettime(CLOCK_REALTIME,&ts0);
loop(1000000000);
t1=clock();
clock_gettime(CLOCK_REALTIME,&ts1);
printf("clock-diff: %lu\n", (unsigned long)((t1 - t0)/CLOCKS_PER_SEC));
printf("clock_gettime-diff: %lu\n", (unsigned long)((ts1.tv_sec - ts0.tv_sec)));
}
//prints 2 and 3 or 2 and 2 on my system
But clocks manpage only describes it as returning an approximation. There's no indication that approximation is comparable to what clock_gettime returns.
Where I get drastically different results is where I throw in multiple threads:
#include <time.h>
#include <stdio.h>
#include <pthread.h>
__attribute((noinline)) void nop(void){}
void loop(unsigned long Cnt) {
for(unsigned long i=0; i<Cnt;i++) nop();
}
void *busy(void *A){ (void)A; for(;;) nop(); }
int main()
{
pthread_t ptids[4];
for(size_t i=0; i<sizeof(ptids)/sizeof(ptids[0]); i++)
pthread_create(&ptids[i], 0, busy, 0);
clock_t t0,t1;
struct timespec ts0,ts1;
t0=clock();
clock_gettime(CLOCK_REALTIME,&ts0);
loop(1000000000);
t1=clock();
clock_gettime(CLOCK_REALTIME,&ts1);
printf("clock-diff: %lu\n", (unsigned long)((t1 - t0)/CLOCKS_PER_SEC));
printf("clock_gettime-diff: %lu\n", (unsigned long)((ts1.tv_sec - ts0.tv_sec)));
}
//prints 18 and 4 on my 4-core linux system
That's because both musl and glibc on Linux use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) to implement clock() and the CLOCK_PROCESS_CPUTIME_ID nonstandard clock is described in the clock_gettime manpage as returning time for all process threads together.

Looking for a single timing method to test various algorithms excluding their inputs

Assume that I have several algorithms to test as follows.
void AlgoA(int x)
{
// critical operations go here
}
void AlgoB(int x, int y)
{
// critical operations go here
}
First Approach
I define Timer that accepts parameterless pointer to function.
void Timer(void (*f)(), unsigned short N = 1)
{
vector<unsigned long long> results;
for (unsigned short i = 0; i < N; i++)
{
chrono::steady_clock::time_point begin = chrono::steady_clock::now();
f();
chrono::steady_clock::time_point end = chrono::steady_clock::now();
unsigned long long interval = chrono::duration_cast<chrono::microseconds>(end - begin).count();
results.push_back(interval);
cout << "Elapsed time: " << interval << std::endl;
}
unsigned long long sum = 0;
for (unsigned long long x : results)
sum += x;
cout << "Average: " << sum / results.size() << endl;
}
Wrappers are needed to prepare the inputs.
void DoA()
{
int x;
// preparing x goes here
AlgoA(x);
}
void DoB()
{
int x, y;
// preparing x and y goes here
AlgoB(x, y);
}
void main()
{
Timer(DoA);
Timer(DoB);
}
Cons: Timer also counts the time elapsed for preparing inputs.
Pros: General Timer for many algo tests.
Second Approach
I have to write 2 timers, each for an algorithm to test.
void TimerA(void (*f)(int), int x, unsigned short N = 1)
{
vector<unsigned long long> results;
for (unsigned short i = 0; i < N; i++)
{
chrono::steady_clock::time_point begin = chrono::steady_clock::now();
f(x);
chrono::steady_clock::time_point end = chrono::steady_clock::now();
unsigned long long interval = chrono::duration_cast<chrono::microseconds>(end - begin).count();
results.push_back(interval);
cout << "Elapsed time: " << interval << std::endl;
}
unsigned long long sum = 0;
for (unsigned long long x : results)
sum += x;
cout << "Average: " << sum / results.size() << endl;
}
void TimerB(void (*f)(int, int), int x, int y, unsigned short N = 1)
{
vector<unsigned long long> results;
for (unsigned short i = 0; i < N; i++)
{
chrono::steady_clock::time_point begin = chrono::steady_clock::now();
f(x, y);
chrono::steady_clock::time_point end = chrono::steady_clock::now();
unsigned long long interval = chrono::duration_cast<chrono::microseconds>(end - begin).count();
results.push_back(interval);
cout << "Elapsed time: " << interval << std::endl;
}
unsigned long long sum = 0;
for (unsigned long long x : results)
sum += x;
cout << "Average: " << sum / results.size() << endl;
}
void main()
{
int x;
// preparing x goes here.
TimerA(AlgoA, x);
int y;
// preparing y goes here.
TimerB(AlgoB, x, y);
}
Pros: Timers count only the critical operations.
Cons: Multiple timers, each for an algo to test.
Question
Is there any way to just create a single Timer but it must not count the time needed to prepare the inputs?
Edit:
The inputs are actually not just int but struct, etc that may be retrieved from time-dependent IO.
I have to admit, I dont see the problem. Maybe the problem is that you overspecified what you want to do. If you want to pass a callable to a method and call it that would be:
template <typename F>
void call_it(F f) {
// start timer
f();
// stop timer
}
Now you can pass almost anything to it, for example:
int x = some_expensive_precalculation();
call_it( [&]() { method_to_time(x); });
Note that you might experience a tiny overhead due to not calling the function directly. However, I expect this to be negligible compared to anything that is worth measuring and with compiler optimizations there might be no overhead at all.
If you want to always prepare your inputs outside of the timer, you can use std::function with std::bind
void timer(std::function<void()> algorithm, unsigned short N = 1) {
// your timer code here
}
void algoA(int x)
{
// critical operations go here
}
void algoB(int x, int y)
{
// critical operations go here
}
int main() {
int x, y; // prepare input
timer(std::bind(algoA, x));
timer(std::bind(algoB, x, y));
}
You can use variadic templates and forward the arguments to your test subject.
#include <chrono>
#include <iostream>
#include<vector>
using namespace std;
template<typename ... Args>
void benchmark(void (*f)(Args...), unsigned short N, Args&&... args)
{
vector<unsigned long long> results;
for (unsigned short i = 0; i < N; i++)
{
chrono::steady_clock::time_point begin = chrono::steady_clock::now();
f(std::forward<Args>(args)...);
chrono::steady_clock::time_point end = chrono::steady_clock::now();
unsigned long long interval = chrono::duration_cast<chrono::microseconds>(end - begin).count();
results.push_back(interval);
cout << "Elapsed time: " << interval << std::endl;
}
unsigned long long sum = 0;
for (unsigned long long x : results)
sum += x;
cout << "Average: " << sum / results.size() << endl;
}
void fun(int a, float b, char c) {
}
int main() {
benchmark(fun, 500, 42, 3.1415f, 'A');
return 0;
}
I think there is no way to have the default argument for N with this approach. But maybe this is not so important for you.

How can i display "Time Elapsed" information from program in c++?

I have a c++ program running in vs2010 as follows:
#include <windows.h>
#include <stdio.h>
#include <time.h>
long factorial(int n)
{
int counter;
long fact = 1;
for (int counter = 1; counter <= n; counter++)
{
fact = fact * counter;
}
Sleep(100);
return fact;
}
int main(void)
{
LARGE_INTEGER freq;
LARGE_INTEGER t0, tF, tDiff;
double elapsedTime;
double resolution;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&t0);
// code to be timed goes HERE
{
factorial(10);
}
QueryPerformanceCounter(&tF);
tDiff.QuadPart = tF.QuadPart - t0.QuadPart;
float deltaseconds = ((float)tDiff.QuadPart)/((float) freq.QuadPart);
elapsedTime = tDiff.QuadPart / (double) freq.QuadPart;
printf("Code under test took %lf sec\n", elapsedTime);
return 0;
}
This program displays the time taken by factorial. Now when i build it then in build window i see :
1>Build succeeded.
1>Time Elapsed 00:00:01.36
Now , i want to get "Time Elapsed" information to be displayed from my program also. So, suggest me how can i fetch this information and display it through my program only or by using some other interface.

Check how long it takes for conhost to print

I have an array of booleans each representing a number. I am printing each one that is true with a for loop: for(unsigned long long l = 0; l<numt; l++) if(primes[l]) cout << l << endl; numt is the size of the array and is equal to over 1000000. The console window takes 30 seconds to print out all the value, but a timer I put in my program says 37ms. How do I wait for all the values to finish printing on the screen in my program so I can include that in my time.
Try this:
#include <windows.h>
...
int main() {
//init code
double startTime = GetTickCount();
//your loop
double timeNeededinSec = (GetTickCount() - startTime) / 1000.0;
}
Just in defense of ctime, cause it gives same result as with GetTickCount:
#include <ctime>
int main()
{
...
clock_t start = clock();
...
clock_t end = clock();
double timeNeededinSec = static_cast<double>(end - start) / CLOCKS_PER_SEC;
...
}
Update:
And the one with time() but in this case we can lost some precision( ~1 sec) because result in seconds.
#include <ctime>
int main()
{
time_t start;
time_t end;
...
time(&start);
...
time(&end);
int timeNeededinSec = static_cast<int>(end-start);
}
Combining both of them in simple example will show you the difference in result. In my tests I saw difference only in value after dot.

Self numbers in c++

Hey, my friends and I are trying to beat each other's runtimes for generating "Self Numbers" between 1 and a million. I've written mine in c++ and I'm still trying to shave off precious time.
Here's what I have so far,
#include <iostream>
using namespace std;
bool v[1000000];
int main(void) {
long non_self = 0;
for(long i = 1; i < 1000000; ++i) {
if(!(v[i])) std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
v[non_self] = 1;
}
std::cout << "1000000" << '\n';
return 0;
}
The code works fine now, I just want to optimize it.
Any tips? Thanks.
I built an alternate C solution that doesn't require any modulo or division operations:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
int v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
for (j4=0; j4<10; j4++) {
for (j3=0; j3<10; j3++) {
for (j2=0; j2<10; j2++) {
for (j1=0; j1<10; j1++) {
s = j6 + j5 + j4 + j3 + j2 + j1;
v[n + s] = 1;
n++;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%6d\n", n);
}
}
It generates 97786 self numbers including 1 and 1000000.
With output, it takes
real 0m1.419s
user 0m0.060s
sys 0m0.152s
When I redirect output to /dev/null, it takes
real 0m0.030s
user 0m0.024s
sys 0m0.004s
on my 3 Ghz quad core rig.
For comparison, your version produces the same number of numbers, so I assume we're either both correct or equally wrong; but your version chews up
real 0m0.064s
user 0m0.060s
sys 0m0.000s
under the same conditions, or about 2x as much.
That, or the fact that you're using longs, which is unnecessary on my machine. Here, int goes up to 2 billion. Maybe you should check INT_MAX on yours?
Update
I had a hunch that it may be better to calculate the sum piecewise. Here's my new code:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
char v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
int s1, s2, s3, s4, s5;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
s5 = j6 + j5;
for (j4=0; j4<10; j4++) {
s4 = s5 + j4;
for (j3=0; j3<10; j3++) {
s3 = s4 + j3;
for (j2=0; j2<10; j2++) {
s2 = s3 + j2;
for (j1=0; j1<10; j1++) {
v[s2 + j1 + n++] = 1;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%d\n", n);
}
}
...and what do you know, that brought down the time for the top loop from 12 ms to 4 ms. Or maybe 8, my clock seems to be getting a bit jittery way down there.
State of affairs, Summary
The actual finding of self numbers up to 1M is now taking roughly 4 ms, and I'm having trouble measuring any further improvements. On the other hand, as long as output is to the console, it will continue to take about 1.4 seconds, my best efforts to leverage buffering notwithstanding. The I/O time so drastically dwarfs computation time that any further optimization would be essentially futile. Thus, although inspired by further comments, I've decided to leave well enough alone.
All times cited are on my (pretty fast) machine and are for comparison purposes with each other only. Your mileage may vary.
Generate the numbers once, copy the output into your code as a gigantic string. Print the string.
Those mods (%) look expensive. If you are allowed to move to base 16 (or even base 2), then you can probably code this a lot faster. If you have to stay in decimal, try creating an array of digits for each place (units, tens, hundreds) and build some rollover code. That will make summating the numbers far easier.
Alternatively, you could recognise the behaviour of the core self function (let's call it s):
s = n + f(b,n)
where f(b,n) is the sum of the digits of the number n in base b.
For base 10, it's clear that as the ones (also known as least significant) digit moves from 0,1,2,...,9, that n and f(b,n) proceed in lockstep as you move from n to n+1, it's only that 10% of the time that 9 rolls to 0 that it doesnt, so:
f(b,n+1) = f(b,n) + 1 // 90% of the time
thus the core self function s advances as
n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2
s(n+1) = s(n) + 2 // again, 90% of the time
In the remaining (and easily identifiable) 10% of the time, the 9 rolls back to zero and adds one to the next digit, which in the simplest case subtracts (9-1) from the running total, but might cascade up through a series of 9s, to subtract 99-1, 999-1 etc.
So the first optimisation can remove most of the work from 90% of your cycles!
if ((n % 10) != 0)
{
n + f(b,n) = n-1 + f(b,n-1) + 2;
}
or
if ((n % 10) != 0)
{
s = old_s + 2;
}
That should be enough to substantially increase your performance without really changing your algorithm.
If you want more, then work out a simple algorithm for the change between iterations for the remaining 10%.
If you want your output to be fast, it may be worth investigating replacing iostream output with plain old printf() - depends on the rules for winning the competition whether this is important.
Multithread (use different arrays/ranges for every thread). Also, dont use more threads than your number of cpu cores =)
cout or printf within a loop will be slow. If you can remove any prints from a loop you will see significant performance increase.
Since the range is limited (1 to 1000000) the maximum sum of the digits does not exceed 9*6 = 54. This means that to implement the sieve a circular buffer of 54 elements should be perfectly sufficient (and the size of the sieve grows very slowly as the range increases).
You already have a sieve-based solution, but it is based on pre-building the full-length buffer (sieve of 1000000 elements), which is rather inelegant (if not completely unacceptable). The performance of your solution also suffers from non-locality of memory access.
For example, this is a possible very simple implementation
#define N 1000000U
void print_self_numbers(void)
{
#define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */
unsigned char marks[NMARKS] = { 0 };
unsigned i, imark;
for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
{
unsigned digits, sum;
if (!marks[imark])
printf("%u ", i);
else
marks[imark] = 0;
sum = i;
for (digits = i; digits > 0; digits /= 10)
sum += digits % 10;
marks[sum % NMARKS] = 1;
}
}
(I'm not going for the best possible performance in terms of CPU clocks here, just illustrating the key idea with the circular buffer.)
Of course, the range can be easily turned into a parameter of the function, while the size of the curcular buffer can be easily calculated at run-time from the range.
As for "optimizations"... There's no point in trying to optimize the code that contains I/O operations. You won't achieve anything by such optimizations. If you want to analyze the performance of the algorithm itself, you'll have to put the generated numbers into an output array and print them later.
For such simple task, the best option would be to think of alternative algorithms to produce the same result. %10 is not usually considered a fast operation.
Why not use the recurrence relation given on the wikipedia page instead?
That should be blazingly fast.
EDIT: Ignore this .. the recurrence relation generates some but not all of the self numbers.
In fact only very few of them. Thats not particularly clear from thewikipedia page though :(
This may help speed up C++ iostreams output:
cin.tie(0);
ios::sync_with_stdio(false);
Put them in main before you start writing to cout.
I created a CUDA-based solution based on Carl Smotricz's second algorithm. The code to identify Self Numbers itself is extremely fast -- on my machine it executes in ~45 nanoseconds; this is about 150 x faster than Carl Smotricz's algorithm, which ran in 7 milliseconds on my machine.
There is a bottleneck, however, and that seems to be the PCIe interface. It took my code a whopping 43 milliseconds to move the computed data from the graphics card back to RAM. This might be optimizable, and I will look in to this.
Still, 45 nanosedons is pretty darn fast. Scary fast, actually, and I added code to my program which runs Carl Smotricz's algorithm and compares the results for accuracy. The results are accurate. Here is the program output (compiled in VS2008 64-bit, Windows7):
UPDATE
I recompiled this code in release mode with full optimization and using static runtime libraries, with signifigant results. The optimizer seems to have done very well with Carl's algorithm, reducing the runtime from 7 ms to 1 ms. The CUDA implementation sped up as well, from 35 us to 20 us. The memory copy from video card to RAM was unaffected.
Program Output:
Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
[kernel] : 35 us (0.09%)
[memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The code is as follows:
file : main.h
#pragma once
#include <cstdlib>
#include <functional>
typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
return make_pair<int*, size_t>(ptr, size);
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);
inline std::string format_elapsed(double d)
{
char buf[256] = {0};
if( d < 0.00000001 )
{
// show in ps with 4 digits
sprintf(buf, "%0.4f ps", d * 1000000000000.0);
}
else if( d < 0.00001 )
{
// show in ns
sprintf(buf, "%0.0f ns", d * 1000000000.0);
}
else if( d < 0.001 )
{
// show in us
sprintf(buf, "%0.0f us", d * 1000000.0);
}
else if( d < 0.1 )
{
// show in ms
sprintf(buf, "%0.0f ms", d * 1000.0);
}
else if( d <= 60.0 )
{
// show in seconds
sprintf(buf, "%0.2f s", d);
}
else if( d < 3600.0 )
{
// show in min:sec
sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
}
// show in h:min:sec
else
sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));
return buf;
}
inline std::string format_pct(double d)
{
char buf[256] = {0};
sprintf(buf, "%.2f", 100.0 * d);
return buf;
}
file: main.cpp
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>
#include "main.h"
int main()
{
unsigned numVals = 1000000;
int* gold = new int[numVals];
memset(gold, 0, sizeof(int)*numVals);
LARGE_INTEGER li = {0}, li2 = {0};
QueryPerformanceFrequency(&li);
__int64 freq = li.QuadPart;
// get cuda properties...
cudaDeviceProp cdp = {0};
cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;
// first run the reference implementation
QueryPerformanceCounter(&li);
for( int j6=0, n = 0; j6<10; j6++ )
{
for( int j5=0; j5<10; j5++ )
{
for( int j4=0; j4<10; j4++ )
{
for( int j3=0; j3<10; j3++ )
{
for( int j2=0; j2<10; j2++ )
{
for( int j1=0; j1<10; j1++ )
{
int s = j6 + j5 + j4 + j3 + j2 + j1;
gold[n + s] = 1;
n++;
}
}
}
}
}
}
QueryPerformanceCounter(&li2);
__int64 ticks = li2.QuadPart-li.QuadPart;
cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;
// now run the cuda version...
unsigned threads = cdp.maxThreadsPerBlock;
unsigned blocks = numVals/threads;
if( numVals%threads ) ++blocks;
unsigned computeSlots = blocks * threads; // this may be != the number of vals since we want 32-thread warps
// allocate device memory for test
int* deviceTest = 0;
err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);
int* hostTest = new int[numVals]; // the repository for the resulting data on the host
memset(hostTest, 0, sizeof(int)*numVals);
// run the CUDA code...
LARGE_INTEGER li3 = {0}, li4={0};
QueryPerformanceCounter(&li3);
ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
QueryPerformanceCounter(&li4);
__int64 ticksCuda = li4.QuadPart-li3.QuadPart;
cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;
unsigned errorCount = 0;
for( size_t i = 0; i < numVals; ++i )
{
if( gold[i] != hostTest[i] )
{
++errorCount;
}
}
cout << "Number of Errors: " << errorCount << endl;
return 0;
}
file: self.cu
#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"
__global__ void SelfNum(int * slots)
{
__shared__ int N;
N = (blockIdx.x * blockDim.x) + threadIdx.x;
const int numDigits = 10;
__shared__ int digits[numDigits];
for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
{
digits[numDigits-i-1] = temp - 10 * (temp/10) /*temp % 10*/;
}
__shared__ int s;
s = 0;
for( int i = 0; i < numDigits; ++i )
s += digits[i];
slots[N+s] = 1;
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned blocks, const unsigned threads)
{
LARGE_INTEGER li = {0};
QueryPerformanceFrequency(&li);
double freq = (double)li.QuadPart;
LARGE_INTEGER liStart = {0};
QueryPerformanceCounter(&liStart);
// run the kernel
SelfNum<<<blocks, threads>>>(deviceMem.first);
LARGE_INTEGER liKernel = {0};
QueryPerformanceCounter(&liKernel);
cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
LARGE_INTEGER liMemcpy = {0};
QueryPerformanceCounter(&liMemcpy);
// display performance stats
double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;
double pKernel = eKernel/e,
pMemcpy = eMemcpy/e;
cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
<< " [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
<< " [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;
}
UPDATE2:
I refactored my CUDA implementation to try to speed it up a bit. I did this by unrolling loops manually, fixing some questionable use of __shared__ memory which might have been an error, and getting rid of some redundancy.
The output of my new kernel is:
Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
[kernel] : 39 us (1.57%)
[memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The only code I changed is the kernel itself, so that's all I will post here:
__global__ void SelfNum(int * slots)
{
int N = (blockIdx.x * blockDim.x) + threadIdx.x;
int s = 0;
int temp = N;
s += temp - 10 * (temp/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
slots[N+s] = 1;
}
I wonder if multi-threading would help. This algorithm looks like it would lend itself well to multi-threading. (Poor-man's test of this: Create two copies of the program and run them at the same time. If it runs in less than 200% of the time, multi-threading may help).
I was actually surprised that the code below was faster then any other posted here. I probably measured it wrong, but maybe it helps; or at least is interesting.
#include <iostream>
#include <boost/progress.hpp>
class SelfCalc
{
private:
bool array[1000000];
int non_self;
public:
SelfCalc()
{
memset(&array, 0, sizeof(array));
}
void operator()(const int i)
{
if (!(array[i]))
std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
array[non_self] = true;
}
};
class IntIterator
{
private:
int value;
public:
IntIterator(const int _value):value(_value){}
int operator*(){ return value; }
bool operator!=(const IntIterator &v){ return value != v.value; }
int operator++(){ return ++value; }
};
int main()
{
boost::progress_timer t;
SelfCalc selfCalc;
IntIterator i(1), end(100000);
std::for_each(i, end, selfCalc);
std::cout << 100000 << std::endl;
return 0;
}
Fun problem. The problem as stated does not specify what base it must be in. I fiddled around with it some and wrote a base-2 version. It generates an extra few thousand entries because the termination point of 1,000,000 is not as natural with base-2. This pre-counts the number of bits in a byte for a table lookup. The generation of the result set (without the I/O) took 2.4 ms.
One interesting thing (assuming I wrote it correctly) is that the base-2 version has about 250,000 "self numbers" up to 1,000,000 while there are just under 100,000 base-10 self numbers in that range.
#include <windows.h>
#include <stdio.h>
#include <string.h>
void StartTimer( _int64 *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
double StopTimer( _int64 t1 )
{
_int64 t2, ldFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&ldFreq );
return ((double)( t2 - t1 ) / (double)ldFreq) * 1000.0;
}
#define RANGE 1000000
char sn[0x100000 + 32];
int bitCount[256];
// precompute bitcounts for each byte
void PreCountBits()
{
int i;
// generate count of bits in each byte
memset( bitCount, 0, sizeof( bitCount ));
for ( i = 0; i < 256; i++ )
{
int tmp = i;
while ( tmp )
{
if ( tmp & 0x01 )
bitCount[i]++;
tmp >>= 1;
}
}
}
void GenBase2( )
{
int i;
int *b1, *b2, *b3;
int b1sum, b2sum, b3sum;
i = 0;
for ( b1 = bitCount; b1 < bitCount + 256; b1++ )
{
b1sum = *b1;
for ( b2 = bitCount; b2 < bitCount + 256; b2++ )
{
b2sum = b1sum + *b2;
for ( b3 = bitCount; b3 < bitCount + 256; b3++ )
{
sn[i++ + *b3 + b2sum] = 1;
}
}
// 1000000 does not provide a great termination number for base 2. So check
// here. Overshoots the target some but avoids repeated checks
if ( i > RANGE )
return;
}
}
int main( int argc, char* argv[] )
{
int i = 0;
__int64 t1;
memset( sn, 0, sizeof( sn ));
StartTimer( &t1 );
PreCountBits();
GenBase2();
printf( "Generation time = %.3f\n", StopTimer( t1 ));
#if 1
for ( i = 1; i <= RANGE; i++ )
if ( !sn[i] ) printf( "%d\n", i );
#endif
return 0;
}
Maybe try just computing the recurrence relation defined below?
http://en.wikipedia.org/wiki/Self_number