Why does my "Hello world" program take almost 10s? - c++

I have installed the CUDA runtime and drivers version 7.0 to my workstation (Ubuntu 14.04, 2xIntel XEON e5 + 4x Tesla k20m). I've used the following program to check whether my installation works:
#include <stdio.h>
__global__ void helloFromGPU()
{
printf("Hello World from GPU!\n");
}
int main(int argc, char **argv)
{
printf("Hello World from CPU!\n");
helloFromGPU<<<1, 1>>>();
printf("Hello World from CPU! Again!\n");
cudaDeviceSynchronize();
printf("Hello World from CPU! Yet again!\n");
return 0;
}
I get the correct output, but it's taken an enourmus amount of time:
$ nvcc hello.cu -O2
$ time ./hello > /dev/null
real 0m8.897s
user 0m0.004s
sys 0m1.017s`
If I remove all device code the overall execution takes 0.001s. So why does my simple program almost take 10 seconds?

The apparent slow runtime of your example is due to the underlying fixed cost of setting up the GPU context.
Because you are running on a platform that supports unified addressing, the CUDA runtime has to map 64GB of host RAM and 4 x 5120MB from your GPUs into a single virtual address space and register that with the Linux kernel.
There are a lot of kernel API calls required to do that, and it isn't fast. I would guess that is the main source of the slow performance you are observing. You should view this as a fixed start-up cost which must be amortised over the life of your application. In real world applications, a 10 second startup is trivial and of no real importance. In a hello world example, it isn't.

Related

CUDA GPU __global__ function does not complete

__global__ void functionA()
{
printf("functionA");
}
int main()
{
printf("main1");
functionA<<<1,1>>>();
printf("main2");
}
I'm trying to run a simple test with the above. But the program only outputs "main1". The program should output "functionA" and "main2" too.
This seems to have two reasons:
First of all you need to add
cudaDeviceSynchronize();
after the CUDA routine in order to block the main until the device has completed all tasks.
Furthermore this might happen if you set the wrong GPU architecture/compute capability XX when compiling the code
$ nvcc -gencode=arch=compute_XX,code=sm_XX -o my_app my_app.cu
In this case only the host code is run while the parts on the accelerator will be omitted it seems. You can find an overview of the corresponding number XX for the different hardware generations over here. The K20m you are running is 35. So it should be
$ nvcc -gencode=arch=compute_35,code=sm_35 -o my_app my_app.cu
in your case.
This might also occur if you have multiple graphic accelerators in your system and the code is executed on the wrong one. Each graphics card/accelerator is assigned a particular device id. The device with number 0 should be assigned automatically to the most powerful device and will be used by default. Therefore the first time I compiled the code on my system containing a powerful Tesla K80 (architecture 37) and a low power Quadro P620 (architecture 60) I selected 37 and had the same error as you have while when selecting 60 the code would run. I then used then the Querying Device Properties example to give me a list of the CUDA-capable devices and their corresponding device id, just to find out that on my system the Tesla K80 is set as 1 and 2 while the simple Quadro P620 graphics card is set as 0. I assume this is the case as the K80 is deprecated in CUDA 11!
You can select the device inside your code with cudaSetDevice or change it when launching the program with
$ CUDA_VISIBLE_DEVICES="1" ./my_app
where 1 has to be replaced by the device id you wish to use. Doing so should make your code run without any problems.
You can also test if this really is the issue this by cloning the Github repository of "Learn CUDA Programming", then browsing Chapter01/01_cuda_introduction/01_hello_world/, compile the make file with $ make and finally run it with $ ./hello_world. It automatically compiles for multiple architectures/compute capabilities and should therefore run without any issue!

OpenCV based programs optimization on minimal linux embedded systems

I'm building my own Embedded Linux OS for Raspberry PI3 using Buildroot. This OS will be used to handle several applications, one of them performs objects detection based on OpenCV (v3.3.0).
I started with Raspbian Jessy + Python but it turned out that it takes a lot of time to execute a simple example, So I decided to design my own RTOS with Optimized features + C++ development instead of Python.
I thought that with these optimizations the 4 cores of RPI + the 1GB RAM will handle such applications. The problem is that even with these things, the simplest Computer Vision programs take a lot of time.
PC vs. Raspberry PI3 Comparaison
This is a simple program I wrote to have an idea of the order of magnitude of execution time of each part of the program.
#include <stdio.h>
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include <time.h> /* clock_t, clock, CLOCKS_PER_SEC */
using namespace cv;
using namespace std;
int main()
{
setUseOptimized(true);
clock_t t_access, t_proc, t_save, t_total;
// Access time.
t_access = clock();
Mat img0 = imread("img0.jpg", IMREAD_COLOR);// takes ~90ms
t_access = clock() - t_access;
// Processing time
t_proc = clock();
cvtColor(img0, img0, CV_BGR2GRAY);
blur(img0, img0, Size(9,9));// takes ~18ms
t_proc = clock() - t_proc;
// Saving time
t_save = clock();
imwrite("img1.jpg", img0);
t_save = clock() - t_save;
t_total = t_access + t_proc + t_save;
//printf("CLOCKS_PER_SEC = %d\n\n", CLOCKS_PER_SEC);
printf("(TEST 0) Total execution time\t %d cycles \t= %f ms!\n", t_total,((float)t_total)*1000./CLOCKS_PER_SEC);
printf("---->> Accessing in\t %d cycles \t= %f ms.\n", t_access,((float)t_access)*1000./CLOCKS_PER_SEC);
printf("---->> Processing in\t %d cycles \t= %f ms.\n", t_proc,((float)t_proc)*1000./CLOCKS_PER_SEC);
printf("---->> Saving in\t %d cycles \t= %f ms.\n", t_save,((float)t_save)*1000./CLOCKS_PER_SEC);
return 0;
}
Results of Execution on an i7 PC
Results of Execution on Raspberry PI (Generated OS from Buildroot)
As you can see there is a huge difference. What I need is to optimize every single detail so that this example processing step occurs in "near" real-time at in a maximum 15ms processing time instead of the 44ms. So these are my questions:
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Is there any other possibilities instead of OpenCV?
Should I use C instead of C++?
Any hardware improvements you recommend?
Well as i understand, you want to get about 30-40fps. In case of your I7: it is fast and having tone of acceleration techniques enabled default by itel. In case of raspberry pi: well, we love it but it is slow, especially for image processing program.
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
You should include some acceleration library for arm and re-compiled opencv again with those features enabled.
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Paralleling your code so it could run on 4 cores
Is there any other possibilities instead of OpenCV?
Ask your self first, what features do you need from OpenCV.
Should I use C instead of C++?
Changing language will not help you at all, stay and love C++. It is a beautiful language and very fast
Any hardware improvements you recommend?
How about other board with mali GPU supported. So you could run opencv code directly on GPU, that will boost up your speed a lot.

pthread multithreading in Mac OS vs windows multithreaing

I've developed a multi platform program (using the FLTK toolkit) and implement multithreading to do background intensive tasks.
I have followed the FLTK tutorials/examples on multithreading which involve using pthread on Mac, ie the function pthread_create and windows threading on windows ie _beginthread
What I have noticed is that the performance is much higher on Windows ie 3 to 4 times faster in these background threads (in the time to execute them).
Why might this be? Is it the threading libraries I'm using? Surely there shouldn't be such a difference? Or could it be the runtime libraries underneath it all?
Here are my machine details
Mac:
Intel(R) Core(TM) i7-3820QM CPU # 2.70GHz
16 GB DDR3 1600 MHz
Model MacBookPro9,1
OS: Mac OSX 10.8.5
Windows:
Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
16 GB DDR3 1600 MHz
Model: Dell Latitude E5530
OS: Windows 7 Service Pack 1
EDIT
To just do a basic speed comparison I compiled this on both machines running from the command line
int main(int agrc, char **argv)
{
time_t t = time(NULL);
tm* tt=localtime(&t);
std::stringstream s;
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"1: "<<s.str()<<std::endl;
double sum=0;
for (int i=0;i<100000000;i++){
double ii=i*0.123456789;
sum=sum+sin(ii)*cos(ii);
}
t = time(NULL);
tt=localtime(&t);
s.str("");
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"2: "<<s.str()<<std::endl;
}
Windows takes less than a second. Mac takes 4/5 seconds. Any ideas?
On Mac I'm compiling with g++, with visual studio 2013 on windows.
SECOND EDIT
if I change the line
std::cout<<"2: "<<s.str()<<std::endl;
to
std::cout<<"2: "<<s.str()<<" "<<sum<<std::endl;
Then all of a sudden Windows takes a little bit longer...
This makes me think that the whole thing might be compiler optimisation. So the question would be is g++ (4.2 is the version I have) worse at optimisation or do I need to provide additional flags?
THIRD(!) AND FINAL EDIT
I can report that I achieve comparable performance by ensuring g++ optimisation flags -O were provided at compile time. One of those annoying things that happens so often
A: Im tearing my hair out on problem x
B: Are you sure you're not doing y?
A: That works, why is this information not plastered all over the place and in every tutorial on problem x on the web?
B: Did you read the manual?
A: No, if I completely read the manual for every single bit of code/program I used I would never actually get round to doing anything...
Meh.

CUDA memory error

I run high-performance calculations on multiple GPUs (two GPUs per machine), currently I test my code on GeForce GTX TITAN. Recently I noticed that random memory errors occur so that I can't rely on the outcome anymore. Tried to debug and ran into things I don't understand. I'd appreciate if someone could help me understand why the following is happening.
So, here's my GPU:
$ nvidia-smi -a
Driver Version : 331.67
GPU 0000:03:00.0
Product Name : GeForce GTX TITAN
...
VBIOS Version : 80.10.2C.00.02
FB Memory Usage
Total : 6143 MiB
Used : 14 MiB
Free : 6129 MiB
Ecc Mode
Current : N/A
Pending : N/A
My Linux machine (Ubuntu 12.04 64-bit):
$ uname -a
Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Here's my code (basically, allocate 4G of memory, fill with zeros, copy back to host and check if all values are zero; spoiler: they're not)
#include <cstdio>
#define check(e) {if (e != cudaSuccess) { \
printf("%d: %s\n", e, cudaGetErrorString(e)); \
return 1; }}
int main() {
size_t num = 1024*1024*1024; // 1 billion elements
size_t size = num * sizeof(float); // 4 GB of memory
float *dp;
float *p = new float[num];
cudaError_t e;
e = cudaMalloc((void**)&dp, size); // allocate
check(e);
e = cudaMemset(dp, 0, size); // set to zero
check(e);
e = cudaMemcpy(p, dp, size, cudaMemcpyDeviceToHost); // copy back
check(e);
for(size_t i=0; i<num; i++) {
if (p[i] != 0) // this should never happen, amiright?
printf("%lu %f\n", i, p[i]);
}
return 0;
}
I run it like this
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$ nvcc test.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
$ ./a.out | head
516836128 -0.000214
516836164 -0.841684
516836328 -3272.289062
516836428 -644673853950867887966360388719607808.000000
516836692 0.000005
516850472 232680927002624.000000
516850508 909806289566040064.000000
...
$ echo $?
0
This is not what I expected: many elements are non-zero. Here a couple of observations
I checked with cuda-memcheck - no errors. Checked with valgrind's memcheck - no errors.
the memory allocation works as expected, nvidia-smi reports 4179MiB / 6143MiB
the same happens if I
allocate less memory (e.g. 2 GB)
compile with -arch sm_30 or -arch compute_30 (see capabilities)
go from SDK version 6.0 back to 5.5
go from GTX Titan to Tesla K20c (here the ECC checking is enabled and all counters are zero); behavior is the same, I was able to test it on five different GPU cards.
allocate multiple smaller arrays on the device
the errors disappear if I test on a GTX 680
Again, the question is: why do I see those memory errors and how can I ensure that this never happens?
I also perform calculations using CPU and we have found the same issue. We are using the model GeForce GTX 660 Ti.
I have check that the number of errors increases with the time which the GPU has been working.
The problem can be solved by shutting down the computer (it doesn't work if the machine is rebooted), but after some time working the problem starts again.
I have no idea why that happens. I have tried several codes to check the memory and all of them give the same result.
As far as I have checked this problem cannot be avoided, and the only way to be sure that your results are ok is to check the memory after the calculations and to shutdown the machine evey so often. I know this is not a good solution but is the only I have found.

file io with /dev/random takes too long

I want to write a program to generate really random number using /dev/random on linux, but later I find the running time of it is quite unacceptable occasionally. C version of it run fast consistently.
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc,char*argv[])
{
ifstream random("/dev/random", ios_base::in);
int t;
random.read(reinterpret_cast<char*>(&t), sizeof(t));
cout << t << endl;
random.close();
return 0;
}
The time statistic of running time
$: time ./random
-1040810404
real 0m0.004s
user 0m0.000s
sys 0m0.000s
$: time ./random
-1298913761
real 0m4.119s
user 0m0.000s
sys 0m0.000s
You have likely drained the entropy pool. Creating (ok harvesting) entropy is based on device drivers which sample qualities of the physical world which are mostly unpredictable. But if those devices are not very active or if the entropy-producing algorithm stalls, your reads from /dev/random will too.
Can you use /dev/urandom? If not, you should look into ways that you can produce more entropy in a more deterministic fashion.
Here are some suggestions from an article regarding a similar problem:
Involve an audio entropy daemon like AED to gather noise from your datacenter with an open microphone, maybe combine it with a webcam noise collector like VED. Other sources are talking about “Cryptographic Randomness from Air Turbulence in Disk devices“. :)
Use the Entropy Gathering Daemon to collect weaker entropy from randomness of userspace programs.