Understanding the usage of OpenCL in OpenCV (Mat/ Umat Objects) - c++

I ran the code below to check for the performance difference between GPU and CPU usage. I am calculating the Average time for cv::cvtColor() function. I make four function calls:
Just_mat()(Without using OpenCL for Mat object)
Just_UMat()(Without using OpenCL for Umat object)
OpenCL_Mat()(using OpenCL for Mat object)
OpenCL_UMat() (using OpenCL for UMat object)
for both CPU and GPU.
I did not find a huge performance difference between GPU and CPU usage.
int main(int argc, char* argv[])
{
loc = argv[1];
just_mat(loc);// Calling function Without OpenCL
just_umat(loc);//Calling function Without OpenCL
cv::ocl::Context context;
std::vector<cv::ocl::PlatformInfo> platforms;
cv::ocl::getPlatfomsInfo(platforms);
for (size_t i = 0; i < platforms.size(); i++)
{
//Access to Platform
const cv::ocl::PlatformInfo* platform = &platforms[i];
//Platform Name
std::cout << "Platform Name: " << platform->name().c_str() << "\n" << endl;
//Access Device within Platform
cv::ocl::Device current_device;
for (int j = 0; j < platform->deviceNumber(); j++)
{
//Access Device
platform->getDevice(current_device, j);
int deviceType = current_device.type();
cout << "Device name: " << current_device.name() << endl;
if (deviceType == 2)
cout << context.ndevices() << " CPU devices are detected." << std::endl;
if (deviceType == 4)
cout << context.ndevices() << " GPU devices are detected." << std::endl;
cout << "===============================================" << endl << endl;
switch (deviceType)
{
case (1 << 1):
cout << "CPU device\n";
if (context.create(deviceType))
opencl_mat(loc);//With OpenCL Mat
break;
case (1 << 2):
cout << "GPU device\n";
if (context.create(deviceType))
opencl_mat(loc);//With OpenCL UMat
break;
}
cin.ignore(1);
}
}
return 0;
}
int just_mat(string loc);// I check for the average time taken for cvtColor() without using OpenCl
int just_umat(string loc);// I check for the average time taken for cvtColor() without using OpenCl
int opencl_mat(string loc);//ocl::setUseOpenCL(true); and check for time difference for cvtColor function
int opencl_umat(string loc);//ocl::setUseOpenCL(true); and check for time difference for cvtColor function
The output(in miliseconds) for the above code is
__________________________________________
|GPU Name|With OpenCL Mat | With OpenCl UMat|
|_________________________________________|
|--Carrizo---|------7.69052 ------ |------0.247069-------|
|_________________________________________|
|---Island--- |-------7.12455------ |------0.233345-------|
|_________________________________________|
__________________________________________
|----CPU---|With OpenCL Mat | With OpenCl UMat |
|_________________________________________|
|---AMD---|------6.76169 ------ |--------0.231103--------|
|_________________________________________|
________________________________________________
|----CPU---| WithOut OpenCL Mat | WithOut OpenCl UMat |
|_______________________________________________|
|----AMD---|------7.15959------ |------------0.246138------------ |
|_______________________________________________|
In code, using Mat Object always runs on CPU & using UMat Object always runs on GPU, irrespective of the code ocl::setUseOpenCL(true/false);
Can anybody explain the reason for all output time variation?
One more question, i didn't use any OpenCL specific .dll with .exe file and yet GPU was used without any error, while building OpenCV with Cmake i checked With_OpenCL did this built all OpenCL required function within opencv_World310.dll ?

In code, using Mat Object always runs on CPU & using UMat Object always runs on GPU, irrespective of the code ocl::setUseOpenCL(true/false);
I'm sorry, because I'm not sure if this is a question or a statement... in either case it's partially true. In 3.0, for the UMat, if you don't have a dedicated GPU then OpenCV just runs everything on the CPU. If you specifically ask for Mat you get it on the CPU. And in your case you have directed both to run on each of your GPUs/CPU by selecting each specifically (more on "choosing a CPU below)... read this:
Few design choices support the new architecture:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call
OpenCL accelerated version explicitly. These functions use an OpenCL
-enabled GPU if exists in the system, and automatically switch to CPU
operation otherwise.
The UMat abstraction enables functions to be called asynchronously.
Unlike the cv::Mat of the OpenCV version 2.x, access to the underlyi
ng data for the cv::UMat is performed through a method of class, and not though its data member. Such an approach enables the
implementation to explicitly wait for GPU completion only when CPU code absolutely needs the result.
The UMat implementation makes use of CPU-GPU shared physical memory available on Intel SoCs, including allocations that come from pointers passed into OpenCV.
I think there also might be a misunderstanding about "using OpenCL". When you use an UMat, you are specifically trying to use the GPU. And, I'll plead some ignorance here, as a result I believe that CV is using some of the CL library to make that happen automatically... as a side in 2.X we had cv::ocl to specifically/manually do this, so be careful if you are using that 2.X legacy code in 3.X. There are reasons to do it, but they are not always straightforward. But, back on topic, when you say,
with OpenCL UMat
you are potentially being redundant. The CL code you have in your snippet is basically finding out what equipment is installed, how many there are, what their names are, and choosing which to use... I'd have to dig through the way it is instantiated, but perhaps when you make it UMat it automatically sets OpenCL to True? (link) That would definitely support the data you presented. You could probably test that idea by checking what the state of ocl::setUseOpenCL after you set it to false and then use an UMat.
Finally, I'm guessing your CPU has a built in GPU. So it is running parallel processing with OpenCL and not paying a time penalty to travel to the seperate/dedicated GPU and back, hence your perceived performance increase over the GPUs (since it is not technically the CPU running it)... only when you are specifically using the Mat is the CPU only being used.
Your last question, I'm not sure... this is my speculation: OpenCL architexture exists on the GPU, when you install CV with CL you are installing the link between the two libraries and associated header files. I'm not sure which dll files you need to make that magic happen.

Related

Why is my thread execution jumping between CPU cores?

I recently started experimenting with std::thread and I tried running a small program that displays the webcam feed in a separate thread and I am using OpenCV. I am just doing this for "educational" purposes. What I noticed was that the thread seemed to keep jumping between cores which striked me as odd since I thought that the overhead of this change would not be worth it from an efficiency/performance side of view. Does anybody know the root/reason for such behavior?
Short disclaimer --> I am new to StackOverflow so if I missed something, please let me know.
A snapshot of my system monitor - Ubuntu
#include <stdio.h>
#include <opencv2/opencv.hpp> //openCV functionality
#include <time.h> //timing functionality
#include <thread>
using namespace cv;
using namespace std;
void webcam_func(){
Mat image;
namedWindow("Display window");
VideoCapture cap(0);
if (!cap.set(CAP_PROP_AUTO_EXPOSURE , 10)){
std::cout <<"Exposure could not be set!" <<std::endl;
//return -1 ;
}
if (!cap.isOpened()) {
cout << "cannot open camera";
}
int i = 0;
while (i < 1000000) {
cap >> image;
Size s = image.size();
int rows = s.height;
int cols = s.width;
imshow("Display window", image);
double fps = cap.get(CAP_PROP_FPS);
//cout << "Frames per second using video.get(CAP_PROP_FPS) : " << fps << endl;
//cout <<"The height of the video is " <<rows <<endl;
//cout <<"The width of the video is " <<cols <<endl;
std::thread::id this_id = std::this_thread::get_id();
std::cout << "thread id --> " << this_id <<std::endl;
waitKey(25);
i++ ;
std::cout <<"Counter value " <<i <<std::endl;
}
}
int main() {
std::thread t1(webcam_func);
while(true){
}
return 0;
}
The default Linux scheduler schedule tasks (eg. threads) for a given quantum (time slice) on available processing units (eg. cores or hardware threads). This quantum can be interrupted if a task enters in sleeping mode or wait for something (inputs, locks, etc.). waitKey(25) exactly does that: it causes your thread to wait for a short period of time. The thread execution is interrupted and a context-switch is done. The OS can execute other tasks during this time. When the computing thread is ready again (because >25 ms has elapsed), the scheduler can schedule it again. It tries to execute the task on the same processing unit so to reduce overheads (eg. cache misses) but the previous processing unit can be still used by another thread when the computing task is being scheduled back. This is unlikely to be the case when there is not many ready tasks or just greedy ones though. Additionally, some processors supports SMT (aka. hyper-threading). For example, many x86-64 Intel processors supports 2 hardware threads per core sharing the same caches. Context-switches between 2 hardware threads lying on the same core are significantly cheaper (eg. far less cache-misses). Also note that the Linux scheduler is not perfect like most other schedulers. In fact, it was bogus few years ago and not even able to fill all available cores when it was possible (see: The Linux Scheduler: a Decade of Wasted Cores). Finally, note that the (direct) overhead of a context-switch is no more than few dozens of micro-seconds on a mainstream Linux PC so having them every few dozens of milliseconds is fine (<1% overhead).

memory error of cv::split in KCF tracker

I am using KCF tracker which works well for sometime but breaks when I re-create a new project doing something else. For example, I use the tracker to track an object while meantime extract some feature from the object but it really does not affect the whole processing of the tracker, and this error comes as well when the code of feature extraction is commented. The error goes here
cv::Mat real(cv::Mat img)
{
std::vector<cv::Mat> planes(3);
//std::cout << img.size() << std::endl;
//std::cout << img.channels() << std::endl;
//planes.resize(img.channels()+1);
cv::split(img, planes); // memory writing problem? Why?
return planes[0];
}
KCFD.exe!FFTTools::real(cv::Mat img={...}) Line 135 C++
BTW, the program is built in debug mode of VS2013. OpenCV is 2.4.13.
I have also modified it to give a 3 or 2-element vector of planes in advance but it does not change.

why tanh has different results in OpenCL and C++ function

here is my OpenCL code.
#include <iostream>
#include <cmath>
#include <CL/cl.hpp>
int main(){
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
cl::Platform default_platform=all_platforms[0];
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
cl::Device default_device=all_devices[0];
std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";
cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(default_platform)(), 0};
cl::Context context = cl::Context(CL_DEVICE_TYPE_ALL, properties);
cl::Program::Sources sources;
std::string kernel_code=
" void __kernel simple_tanh(__global const float *A, __global float *B){ "
" B[get_global_id(0)]=tanh(A[get_global_id(0)]); "
" } ";
sources.push_back({kernel_code.c_str(),kernel_code.length()});
cl::Program program(context,sources);
if(program.build({default_device})!=CL_SUCCESS){
std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
exit(1);
}
cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(float));
cl::Buffer buffer_B(context,CL_MEM_READ_WRITE,sizeof(float));
float A[1]; A[0] = 0.0595172755420207977294921875000000000000f;
cl::CommandQueue queue(context,default_device);
queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(float),A);
queue.finish();
cl::Kernel kernel=cl::Kernel(program,"simple_tanh");
kernel.setArg(0,buffer_A);
kernel.setArg(1,buffer_B);
queue.enqueueNDRangeKernel(kernel,cl::NullRange,cl::NDRange(1),cl::NullRange);
queue.finish();
float B[1];
queue.enqueueReadBuffer(buffer_B,CL_TRUE,0,sizeof(float),B);
printf("result: %.40f %.40f\n", tanh(A[0]), B[0]);
return 0;
}
after I compile with this cmd: g++ -std=c++0x hello.cc -lOpenCL -o hello, and run it. I got different results of tanh function.
Using device: Tahiti
result: 0.0594470988394579374913817559900053311139 0.0594470985233783721923828125000000000000
the first is the cpu result, and the second is the OpenCL function. which one should I trust?
When a kernel is unable to be vectorized by compiler(opencl), generated instructions could be scalar types. Then, x87 FPU computes 80 bit. SSE has precision more comparable to GPU, you need float4 or float8 in your kernel such that compiler can produce SSE/AVX which has closer precision to GPU.
Generally Intel's opencl compiler vectorizes better(for some old CPUs at least). Which implementation are you using? There can be differences even between GPUs but they all obey the rule of not crossing ULP limit. If you need more precision with GPU(and SSE/AVX), why not write you own series expansion function then? But it would make learning very slow but faster than a single FPU at least.
What is your CPU? What opencl platform are you using? Did you check generated codes for kernel with some profiler software or some kernel analyzer?
Above all, you shouldn't do this:
cl::NDRange(1)
unless it's learning purpose. This will have %99 kernel launch overhead, %1 data copy overhead and close to zero compute latency. Maybe thats why it's using 80-bit FPU instead of SSE(on CPU). Try computing for multiple-of-8 ndrange values or use float8 types in kernel to let compiler use vectorized instructions.
When global ndrange value is millions, it will have significant effect on learning time, not leraning iterations needed. If CPU can finish learning in 1 day with 1M iterations, maybe GPU can finish it in 1-hour even if it needs 10M iterations. Transcandental functions have high compute to data ratio so speed up ratio versus CPU is higher if you use more of them.
If you derive your own series expansion function to achieve more precision, it would still be much faster than single CPU core in this embarrassingly parallel kernel code.
If the neural network has only a few neurons, then maybe you can do N networks training at the same time and pick the best learner(if learning has any randomization)? So it picks even better results than CPU?

OpenCL struct values correct on CPU but not on GPU

I do have a struct in a file wich is included by the host code and the kernel
typedef struct {
float x, y, z,
dir_x, dir_y, dir_z;
int radius;
} WorklistStruct;
I'm building this struct in my c++ host code and passing it via a buffer to the OpenCL kernel.
If I'm choosing an CPU device for computation I will get the following result:
printf ( "item:[%f,%f,%f][%f,%f,%f]%d,%d\n", item.x, item.y, item.z, item.dir_x, item.dir_y,
item.dir_z , item.radius ,sizeof(float));
Host:
item:[20.169043,7.000000,34.933712][0.000000,-3.000000,0.000000]1,4
Device (CPU):
item:[20.169043,7.000000,34.933712][0.000000,-3.000000,0.000000]1,4
And if I choose a GPU device (AMD) for computation weird things are happening:
Host:
item:[58.406261,57.786015,58.137501][2.000000,2.000000,2.000000]2,4
Device (GPU):
item:[58.406261,2.000000,0.000000][0.000000,0.000000,0.000000]0,0
Notable is that the sizeof(float) is garbage on the gpu.
I assume there is a problem with the layouts of floats on different devices.
Note: the struct is contained in an array of structs of this type and every struct in this array is garbage on GPU
Anyone does have an idea why this is the case and how I can predict this?
EDIT I added an %d at the and and replaced it by an 1, the result is:1065353216
EDIT: here two structs wich I'm using
typedef struct {
float x, y, z,//base coordinates
dir_x, dir_y, dir_z;//directio
int radius;//radius
} WorklistStruct;
typedef struct {
float base_x, base_y, base_z; //base point
float radius;//radius
float dir_x, dir_y, dir_z; //initial direction
} ReturnStruct;
I tested some other things, it looks like a problem with printf. The values seems to be right. I passed the arguments to the return struct, read them and these values were correct.
I don't want to post all of the related code, this would be a few hundred lines.
If noone has an idea I would compress this a bit.
Ah, and for printing I'm using #pragma OPENCL EXTENSION cl_amd_printf : enable.
Edit:
Looks really like a problem with printf. I simply don't use it anymore.
There is a simple method to check what happens:
1 - Create host-side data & initialize it:
int num_points = 128;
std::vector<WorklistStruct> works(num_points);
std::vector<ReturnStruct> returns(num_points);
for(WorklistStruct &work : works){
work = InitializeItSomehow();
std::cout << work.x << " " << work.y << " " << work.z << std::endl;
std::cout << work.radius << std::endl;
}
// Same stuff with returns
...
2 - Create Device-side buffers using COPY_HOST_PTR flag, map it & check data consistency:
cl::Buffer dev_works(..., COPY_HOST_PTR, (void*)&works[0]);
cl::Buffer dev_rets(..., COPY_HOST_PTR, (void*)&returns[0]);
// Then map it to check data
WorklistStruct *mapped_works = dev_works.Map(...);
ReturnStruct *mapped_rets = dev_rets.Map(...);
// Output values & unmap buffers
...
3 - Check data consistency on Device side as you did previously.
Also, make sure that code (presumably - header), which is included both by kernel & host-side code is pure OpenCL C (AMD compiler sometimes can "swallow" some errors) and that you've imported directory for includes searching, when building OpenCL kernel ("-I" flag at clBuildProgramm stage)
Edited:
At every step, please collect return codes (or catch exceptions). Beside that, "-Werror" flag at clBuildProgramm stage can also be helpfull.
It looks like I used the wrong OpenCL headers for compiling. If I try the code on the Intel platform(OpenCL 1.2) everything is fine. But on my AMD platform (OpenCL 1.1) I get weird values.
I will try other headers.

Boost.Compute slower than plain CPU?

I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>
namespace compute = boost::compute;
int main()
{
// generate random data on the host
std::vector<float> host_vector(16000);
std::generate(host_vector.begin(), host_vector.end(), rand);
BOOST_FOREACH (auto const& platform, compute::system::platforms())
{
std::cout << "====================" << platform.name() << "====================\n";
BOOST_FOREACH (auto const& device, platform.devices())
{
std::cout << "device: " << device.name() << std::endl;
compute::context context(device);
compute::command_queue queue(context, device);
compute::vector<float> device_vector(host_vector.size(), context);
// copy data from the host to the device
compute::copy(
host_vector.begin(), host_vector.end(), device_vector.begin(), queue
);
auto start = boost::chrono::high_resolution_clock::now();
compute::transform(device_vector.begin(),
device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(), queue);
auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
std::cout << "-------------------\n";
}
}
std::cout << "====================plain====================\n";
auto start = boost::chrono::high_resolution_clock::now();
std::transform(host_vector.begin(),
host_vector.end(),
host_vector.begin(),
[](float v){ return std::sqrt(v); });
auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
return 0;
}
And here's the sample output on my machine (win7 64-bit):
====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU # 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms
My question is: why is the plain (non-opencl) version faster?
As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).
To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).
Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:
float ans = 0;
compute::transform_reduce(
device_vector.begin(),
device_vector.end(),
&ans,
compute::sqrt<float>(),
compute::plus<float>(),
queue
);
std::cout << "ans: " << ans << std::endl;
You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).
I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-
CPU GPU
copy data to GPU
set up compute code
calculate sqrt calculate sqrt
sum sum
copy data from GPU
Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.
You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.
In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)
You're getting bad results because you're measuring time incorrectly.
OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)
CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.
Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.
In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.
So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.