How to make comparable timemeasurement in cuda and c++ code - c++

I have a cuda and a cpp implementation of the same algorithm. In CUDA I make the timemeasurement with events:
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0); // start time measurement
// some cuda stuff
cudaEventRecord(stop, 0); // stop time measurement
cudaEventSynchronize(stop); // sync results
cudaEventElapsedTime(&time, start, stop);
printf ("Elapsed time : %f ms\n", time);
In c++ I measure with timeofday:
struct timeval start, end;
long seconds, useconds;
float mseconds;
gettimeofday(&start, NULL);
// some work to do
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
mseconds = (seconds * 1000 + useconds/1000.0) + 0.5;
printf ("Elapsed time : %f ms\n", mseconds);
Is this the correct way to get good- comparable results?
Thanks in advance!

Yes, this is a good way to get CPU-vs-GPU time comparisons.
There are multiple ways to get CPU timings, of course, ranging from high-resolution system timers to __rdtsc intrinsics. But for such a coarse comparison either should work just fine.
If you want to dig deeper into your GPU performance and look for potential areas of improvement, you may want to look at the command-line CUDA profiler nvprof, or at the Visual Profiler, which does the same thing but also has a GUI.

If you simply want to compare the whole execution time of your CUDA-related stuff, you can keep your C++ time measurements. Just ensure your device has finished every single task it had before checking elapsed time :
gettimeofday(&start, NULL);
// some work to do
cudaDeviceSynchronize();
gettimeofday(&end, NULL);
This is a simple way to compute how much time your tasks took on device side with CUDA compared to CPU side.
As suggested by ApoorvaJ, if you need to go deeper into CUDA performance to check where are the device bottlenecks, you can use the Visual Profiler. If you are using Visual Studio, check these steps I wrote for another SO user who wanted to check the PTX code. You just have to explore the other data the Visual Profiler can provide, and there is a lot !
Check the Profiler section on the official CUDA documentation from Nvidia.

Related

Linux measuring time problem! std::chrono, QueryPerformanceCounter, clock_gettime

I use clock_gettime() in Linux and QueryPerformanceCounter() in Windows to measure time. When measuring time, I encountered an interesting case.
Firstly, I'm calculating DeltaTime in infinite while loop. This loop calls some update functions. To calculating DeltaTime, the program's waiting in 40 milliseconds in an Update function because update functions is empty yet.
Then, in the program compiled as Win64-Debug i measure DeltaTime. It's approximately 0.040f. And this continues as long as the program is running (Win64-Release works like that too). It runs correctly.
But in the program compiled as Linux64-Debug or Linux64-Release, there is a problem.
When the program starts running. Everything is normal. DeltaTime is approximately 0.040f. But after a while, deltatime is calculated 0.12XXf or 0.132XX, immediately after it's 0.040f. And so on.
I thought I was using QueryPerformanceCounter correctly and using clock_gettime() incorrectly. Then I decided to try it with the standard library std::chrono::high_resolution_clock, but it's the same. No change.
#define MICROSECONDS (1000*1000)
auto prev_time = std::chrono::high_resolution_clock::now();
decltype(prev_time) current_time;
while(1)
{
current_time = std::chrono::high_resolution_clock::now();
int64_t deltaTime = std::chrono::duration_cast<std::chrono::microseconds>(current_time - previous_time).count();
printf("DeltaTime: %f", deltaTime/(float)MICROSECONDS);
NetworkManager::instance().Update();
prev_time = current_time;
}
void NetworkManager::Update()
{
auto start = std::chrono::high_resolution_clock::now();
decltype(start) end;
while(1)
{
end = std::chrono::high_resolution_clock::now();
int64_t y = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
if(y/(float)MICROSECONDS >= 0.040f)
break;
}
return;
}
Normal
Problem
Possible causes:
Your clock_gettime is not using VDSO and is a system call instead - will be visible if run under strace, can be configured on modern kernel versions.
Your thread gets preempted (taken out of CPU by the scheduler). To run a clean experiment run your app with real time priority and pinned to a specific CPU core.
Also, I would disable CPU frequency scaling when experimenting.

execution time in multithreading environment

I am trying to measure a multi-thread program's execution time. I've use this piece of code in main program for calculating time:
clock_t startTime = clock();
//do stuff
clock_t stopTime = clock();
float secsElapsed = (float)(stopTime - startTime)/CLOCKS_PER_SEC;
Now the problem i have is:
for example I run my program with 4 thread(each thread running on one core),
the execution time is 21.39 . I check my system monitor in run time, where the execution time is about 5.3.
It seems that the actual execution time is multiplied by the number of THREADS.
What is the problem??
It is because you are monitoring the CPU time which is the accumulated time spent by CPU executing your code and not the Wall time which is the real-world time elapsed between your startTime and stopTime.
Indeed clock_t :
Returns the processor time consumed by the program.
If you do the maths : 5.3 * 4 = 21.2 which is what you obtain meaning that you have good multithreaded code with a speedup of 4.
So to measure the wall time, you should rather use std::chrono::high_resolution_clock for instance and you should get back 5.3. You can also use the classic gettimeofday().
If you use OpenMP for multithreading you also have omp_get_wtime()
double startTime = omp_get_wtime();
// do stuff
double stopTime = omp_get_wtime();
double secsElapsed = stopTime - startTime; // that's all !

Getting milliseconds accuracy current time in Qt

Qt documentation about QTime::currentTime() says :
Note that the accuracy depends on the accuracy of the underlying
operating system; not all systems provide 1-millisecond accuracy.
But is there any way to get this time with milliseconds accuracy in windows 7?
You can use QDateTime class and convert the current time with the appropriate format:
QDateTime::currentDateTime().toString("yyyy/MM/dd hh:mm:ss,zzz")
where 'z' corresponds to miliseconds accuracy.
you can use the functionality provided by time.h header file in C/C++.
#include <time.h>
clock_t start, end;
double cpu_time_used;
int main()
{
start = clock();
/* Do the work. */
end = clock();
cpu_time_used = ((double)(end-start)/ CLOCKS_PER_SEC);
}
Timer resolution may vary on different platforms and readings may not be accurate. If you need to get high-resolution, accurate timestamps on Windows 7, it provides QPC API:
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
GetSystemTimePreciseAsFileTime is claimed to provide system time with <1us resolution.
But that's only about accurate timestamp. If you need to actually do something with 1 ms latency (ex. handle an event), you need a RTOS, not a desktop clunker.
One common way would be to scale up whatever you are doing and do it 10-100 times in a row, that way you would be able get a more accurate time reading of whatever you are doing, by dividing the result by 10-100.
But getting millisecond precise readings of your time is pretty much useless because you don't have 100% of the cpu time, which means that your readings will have much greater variance than just 1 millisecond if the OS gives another process computing time while you are doing your actions.

Measuring execution time of OpenCL kernels

I have the following loop that measures the time of my kernels:
double elapsed = 0;
cl_ulong time_start, time_end;
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, NULL, 0, NULL, &event); checkErr(err, "Kernel run");
err = clWaitForEvents(1, &event); checkErr(err, "Kernel run wait fro event");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL); checkErr(err, "Kernel run get time start");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL); checkErr(err, "Kernel run get time end");
elapsed += (time_end - time_start);
}
Then I divide elapsed by NUMBER_OF_ITERATIONS to get the final estimate. However, I am afraid the execution time of individual kernels is too small and hence can introduce uncertainty into my measurement. How can I measure the time taken by all NUMBER_OF_ITERATIONS kernels combined?
Can you suggest a profiling tool, which could help with this, as I do not need to access this data programmatically. I use NVIDIA's OpenCL.
you need follow next steps to measure the execution time of OpenCL kernel execution time:
Create a queue, profiling need been enable when the queue is created:
cl_command_queue command_queue;
command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);
Link an event when launch a kernel
cl_event event;
err=clEnqueueNDRangeKernel(queue, kernel, woridim, NULL, workgroupsize, NULL, 0, NULL, &event);
Wait for the kernel to finish
clWaitForEvents(1, &event);
Wait for all enqueued tasks to finish
clFinish(queue);
Get profiling data and calculate the kernel execution time (returned by the OpenCL API in nanoseconds)
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f milliseconds \n",nanoSeconds / 1000000.0);
The profiling function returns nano seconds, and is very accurate (~50ns), however, the execution has different execution times, depending on others minor issues you can't control.
This reduces your problematic on what you want to measure:
Measuring the kernel execution time: Your approach is correct, the accuracy of the average execution time measured will increase as you increase N. This accounts only for the execution time, no overhead taken in consideration.
Measuring the kernel execution time + overhead: You should use events as well but measure since CL_PROFILING_COMMAND_SUBMIT, to account for extra execution overhead.
Measuring the real host side execution time: You should use events as well but measure since the first event start, to the last event end. Using CPU timing measurement is another possibility. If you want to measure this, then you should remove the waitforevents from the loop, to allow maximum throughput to the OpenCL system (and less overhead possible).
Answering the Tools question, I recomend to use nVIDIA visual profiler. BUt since is no longer available for OpenCL, you should use the Visual Studio Add on or an old version (CUDA 3.0) of the nvprofiler.
The time that is measured is returned in nanoseconds, but you're right: The resolution of the timer is lower. However, I wonder what the actual execution time of your kernel is when you say that the time is too short to measure accurately (my gut feeling is that the resolution should be in the range of microseconds).
The most appropriate way of measuring the total time of multiple iterations depends on what "multiple" means here. Is NUMBER_OF_ITERATIONS=5 or NUMBER_OF_ITERATIONS=500000? If the number of iterations is "large", you may just use the system clock, possibly with OS-specific functions like QueryPerformanceCounter on windows (also see, for example, Is there a way to measure time up to micro seconds using C standard library? ), but of course, the precision of the system clock might be lower than that of the OpenCL device, so whether this makes sense really depends on the number of iterations.
It's a pity that NVIDIA removed OpenCL support from their Visual Profiler, though...
On Intel's OpenCL GPU implementation I've been successful with your approach (timing per kernel) and prefer it to batching a stream of NDRanges.
An alternate approach is to run it N times with and measure time with markers events as in the approach proposed in this question (the question not the answer).
Times for short kernels are generally at least in the microseconds realm in my experience.
You can check the timer resolution using clGetDeviceInfo with CL_DEVICE_PROFILING_TIMER_RESOLUTION (e.g. 80 ns on my setup).

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.
Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.
Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)