Time recquired for OpenCL kernel deletion - c++

I'm encountering an unexpected performance with my OpenCL code (more precisely, I use boost::compute 1.67.0). For now, I just want to add each elements of 2 buffers c[i] = a[i] + b[i].
I noticed some speed reduction in comparison of an existing SIMD implementation so I isolated each step to highlight which one is time consuming. Here is my code sample :
Chrono chrono2;
chrono2.start();
Chrono chrono;
ipReal64 elapsed;
// creating the OpenCL context and other stuff
// ...
std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
__kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
{
size_t idx = get_global_id(0);
out[idx] = in1[idx] + in2[idx];
}
);
boost::compute::program* program = new boost::compute::program;
try {
chrono.start();
*program = boost::compute::program::create_with_source(kernel_src, context);
elapsed = chrono.elapsed();
std::cout << "Create program : " << elapsed << "s" << std::endl;
chrono.start();
program->build();
elapsed = chrono.elapsed();
std::cout << "Build program : " << elapsed << "s" << std::endl;
}
catch (boost::compute::opencl_error& e) {
std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
return;
}
boost::compute::kernel* kernel = new boost::compute::kernel;
try {
chrono.start();
*kernel = program->create_kernel("add_knl");
elapsed = chrono.elapsed();
std::cout << "Create kernel : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
// Pass the argument to the kernel
kernel->set_arg(0, bufIn1);
kernel->set_arg(1, bufIn2);
kernel->set_arg(2, bufOut);
elapsed = chrono.elapsed();
std::cout << "Set args : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
elapsed = chrono.elapsed();
std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
return;
}
std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;
chrono.start();
delete program;
elapsed = chrono.elapsed();
std::cout << "Delete program : " << elapsed << "s" << std::endl;
delete kernel;
elapsed = chrono.elapsed();
std::cout << "Delete kernel : " << elapsed << "s" << std::endl;
And here is a sample of result (I run my program on a NVidia GeForce GT 630, with NVidia SDK TookKit) :
Create program : 0.0013123s
Build program : 0.0015421s
Create kernel : 6.6e-06s
Set args : 1.7e-06s
Kernel calculation : 0.0001639s
[Function] Full duration : 0.0077794
Delete program : 4.1e-06s
Delete kernel : 0.0879901s
I know my program is simple and I don't expect having the kernel execution being the most time consumming step. However, I thought the kernel deletion would take only a few ms, such as creating or building the program.
Is this a normal behaviour?
Thanks

I'll point out that I've never used boost::compute, but it looks like it's a fairly thin wrapper over OpenCL, so the following should be correct:
Enqueueing the kernel does not wait for it to complete. The enqueue function returns an event, which you can then wait for, or you can wait for all tasks enqueued onto the queue to complete. You are timing neither of those things. What is likely happening is that when you destroy your kernel, it waits for all queued instances which are still pending to complete before returning from the destructor.

Related

Is std::clock() broken on MSYS2's g++ compiler?

I'm trying to write a simple single header benchmarker and I understand that std::clock will give me the time that a process (thread) is in actual use.
So, given the following simplified program:
nt main() {
using namespace std::literals::chrono_literals;
auto start_cpu = std::clock();
auto start_wall = std::chrono::high_resolution_clock::now();
// clobber();
std::this_thread::sleep_for(1s);
// clobber();
auto finish_cpu = std::clock();
auto finish_wall = std::chrono::high_resolution_clock::now();
std::cerr << "cpu: "
<< start_cpu << " " << finish_cpu << " "
<< (finish_cpu - start_cpu) / (double)CLOCKS_PER_SEC << " s" << std::endl;
std::cerr << "wall: "
// << FormatTime(start_wall) << " " << FormatTime(finish_wall) << " "
<< (finish_wall - start_wall) / 1.0s << " s" << std::endl;
return 0;
}
Demo
We get the following output:
cpu: 4820 4839 1.9e-05 s
wall: 1.00007 s
I just want to clarify that the cpu time is the time that it executes the code that is not actually the sleep_for code as that is actually done by the kernel which std::clock doesn't track. So to confirm, I changed what I was timing:
int main() {
using namespace std::literals::chrono_literals;
int value = 0;
auto start_cpu = std::clock();
auto start_wall = std::chrono::high_resolution_clock::now();
// clobber();
for (int i = 0; i < 1000000; ++i) {
srand(value);
value = rand();
}
// clobber();
std::cout << "value = " << value << std::endl;
auto finish_cpu = std::clock();
auto finish_wall = std::chrono::high_resolution_clock::now();
std::cerr << "cpu: "
<< start_cpu << " " << finish_cpu << " "
<< (finish_cpu - start_cpu) / (double)CLOCKS_PER_SEC << " s" << std::endl;
std::cerr << "wall: "
// << FormatTime(start_wall) << " " << FormatTime(finish_wall) << " "
<< (finish_wall - start_wall) / 1.0s << " s" << std::endl;
return 0;
}
Demo
This gave me an output of:
cpu: 4949 1398224 1.39328 s
wall: 2.39141 s
value = 354531795
So far, so good. I then tried this on my windows box running MSYS2's g++ compiler. The output for the last program gave me:
value = 0
cpu: 15 15 0 s
wall: 0.0080039 s
std::clock() is always outputting 15? Is the compiler implementation of std::clock() broken?
Seems that I assumed that CLOCKS_PER_SEC would be the same. However, on the MSYS2 compiler, it was 1000x less then on godbolt.org.

C++ - Closure template class behaves strangely when multihreading depending on closure types

I am trying to write my own c++ wrapper class for linux using pthreads. The class 'Thread' is supposed to get a generic lambda to run in a different thread and abstract away the required pthread calls for that.
This works fine if the lambdas don't capture anything, however as soon as they capture some shared variables the behaviour seems to become undefined depending on whether or not the template types of two threads are the same or not. If both Thread objects caputre the same types (int and int*) it seems to (probably accidentally) work correctly, however as soon as i pass (e.g. like in my example) an integer A 'aka. stackInt' and an int ptr B 'aka. heapInt' to Thread 1 and only the int ptr B to Thread 2 i get a segfault in Thread 2 while accessing int ptr B.
I know that it must have something to do with the fact that each thread gets its own copy of the stack segment however i cant wrap my head around how that interferes with colsures capturing variables by refernce and calling them. Shouldn't the int ptr B's value point to the same address in each copy of it on the stack? How does the adress get messed up? I really can't wrap my head around whats the exact issue here..
Can anyone help me out here? Thank you in advance.
Here is the full example code:
class 'Thread'
// thread.h
#pragma once
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
// ******************************* //
// THREAD CLASS //
// ******************************* //
template <typename C>
class Thread
{
private:
C &m_closure;
pthread_t m_thread;
public:
Thread<C>(C &&closure)
: m_closure(closure),
m_thread()
{}
void start()
{
pthread_create(&m_thread, NULL, &Thread::threadFunction, (void *)this);
}
void join()
{
pthread_join(m_thread, NULL);
}
private:
void callbackOnInstance()
{
m_closure();
}
static void * threadFunction(void *);
};
template <typename C>
void * Thread<C>::threadFunction(void *caller)
{
Thread<C> *callerObject = (Thread<C> *)caller;
callerObject->callbackOnInstance();
return nullptr;
}
main() / testing
// main.cpp
// ******************************* //
// TESTING //
// ******************************* //
#define SLEEP_SEC(_sec) usleep((long)(1000 * 1000 * (_sec)))
#include "thread.h"
#include <iostream>
#include <string>
int main(int argc, char **argv)
{
int stackInt = 0;
int *heapInt = new int(0);
// every second each thread increments them, 0.5 s apart from each other
Thread thread1([&]()
{
while(true)
{
SLEEP_SEC(1);
std::cout << "thread #1:" << std::endl;
stackInt += 1;
std::cout << "stack int: " << stackInt << " [" << &stackInt << "]" << std::endl;
*heapInt += 1;
std::cout << "heap int: " << *heapInt << " [" << heapInt << "]" << std::endl;
}
});
thread1.start();
Thread thread2([&]()
{
SLEEP_SEC(0.5);
while(true)
{
SLEEP_SEC(1);
std::cout << "thread #2:" << std::endl;
// if stackInt doesn't get referenced ...
//stackInt += 1;
//std::cout << "stack int: " << stackInt << " [" << &stackInt << "]" << std::endl;
// ... i get a segfault here
*heapInt += 1;
std::cout << "heap int: " << *heapInt << " [" << heapInt << "]" << std::endl;
}
});
thread2.start();
thread1.join();
thread2.join();
}
You've got undefined behaviour, since the lambda objects are actually destroyed immediately after the constructor of Thread completes. To see this, instead of a lambda you could pass an object that prints a message in the destructor:
struct ThreadFunctor
{
int& stackInt;
int* heapInt;
ThreadFunctor(int& si, int* hi)
: stackInt(si),
heapInt(hi)
{
std::cout << "ThreadFunctor created: " << this << '\n';
}
~ThreadFunctor()
{
std::cout << "ThreadFunctor destroyed: " << this << '\n';
}
void operator()() const
{
using namespace std::chrono_literals;
while (true)
{
std::this_thread::sleep_for(1s);
std::cout << "thread #1:" << std::endl;
stackInt += 1;
std::cout << "stack int: " << stackInt << " [" << &stackInt << "]" << std::endl;
*heapInt += 1;
std::cout << "heap int: " << *heapInt << " [" << heapInt << "]" << std::endl;
}
}
};
Thread thread1(ThreadFunctor{stackInt, heapInt});
std::cout << "before start\n";
thread1.start();
The following output is guaranteed for every standard compliant C++ compiler (modulo addresses):
ThreadFunctor destroyed: 0000006E3ED6F6F0
before start
...
Furthermore the join operation only completes after the operation on the background thread has been completed, so because of the infinite loops your program won't terminate. You need some way of notifying the background threads to actually return instead of continuing forever.
Note that the standard library already contains the exact logic you're trying to implement here: std::thread or std::jthread for a implementation with builtin way of informing the background thread of a termination request.
int main()
{
using namespace std::chrono_literals;
int stackInt = 0;
int* heapInt = new int(0);
// every second each thread increments them, 0.5 s apart from each other
std::jthread thread1{ [=](std::stop_token stopToken, int& stackInt)
{
using namespace std::chrono_literals;
while (!stopToken.stop_requested())
{
std::this_thread::sleep_for(1s);
std::cout << "thread #1:" << std::endl;
stackInt += 1;
std::cout << "stack int: " << stackInt << " [" << &stackInt << "]" << std::endl;
*heapInt += 1;
std::cout << "heap int: " << *heapInt << " [" << heapInt << "]" << std::endl;
}
}, std::ref(stackInt) }; // thread started immediately
std::this_thread::sleep_for(10s);
thread1.request_stop();
thread1.join();
}

OpenCL Child Kernel Error

Aloha,
I'm struggling with OpenCL child kernel feature.
Kernel SRC (Minimal example):
kernel void launcher()
{
ndrange_t ndrange = ndrange_1D(1);
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
^{
size_t id = get_global_id(0);
}
);
}
stdafx.h:
#pragma once
#define __CL_ENABLE_EXCEPTIONS
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#include "targetver.h"
#include <CL/cl2.hpp>
#include <iostream>
#include <string>
Full SRC (Minimal):
#include "stdafx.h"
std::string kernel2_source(
"kernel void launcher() ""\n"
"{ ""\n"
" ndrange_t ndrange = ndrange_1D(1);""\n"
" enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,""\n"
" ^{""\n"
" size_t id = get_global_id(0);""\n"
" }""\n"
" );""\n"
"}""\n");
//Number of Input Elements
constexpr int numTriangles = 10;
cl_int errorcode = CL_BUILD_ERROR; //Has to be set to build error, because errorcode isn't set when exception occurs
//Move variable definitions out of main for test purposes;
//Numerous definitions
cl::Program program;
std::vector<cl::Device> devices;
std::vector<cl::Platform> platforms;
cl::CommandQueue queue;
cl::Program::Sources source{ kernel2_source };
int main() {
try {
// Query for platforms
cl::Platform::get(&platforms);
std::cout << "Num Platforms: " << platforms.size() << std::endl;
// Get a list of devices on this platform
platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &devices);
std::cout << "Using platform: " << platforms[0].getInfo<CL_PLATFORM_NAME>() << std::endl;
std::cout << "Num Devices: " << devices.size() << std::endl;
// Create a context for the devices
std::cout << "Using device: " << devices[0].getInfo<CL_DEVICE_NAME>() << std::endl;
//Create a context for the first device
//cl::Context context({ devices[0]});
cl::Context context({ devices[0] });
// Create a command−queue for the first device
queue = cl::CommandQueue(context, devices[0]);
cl::DeviceCommandQueue deviceQueue;
deviceQueue = cl::DeviceCommandQueue(context, devices[0]);
// Create the program from the source code
program = cl::Program(context, source);
std::cout << "Building Program" << std::endl;
// Build the program for the devices
errorcode = program.build("-cl-std=CL2.0 -g");
std::cout << "Success!" << std::endl;
cl::Kernel kernel = cl::Kernel(program, "launcher");
cl::NDRange global = numTriangles;
cl::NDRange local = 1;
queue.enqueueNDRangeKernel(kernel, cl::NullRange, global, local);
std::cout << "finished" << std::endl;
std::cin.get();
}
catch (cl::Error error)
{
std::cout << "Error!" << std::endl;
std::cout << error.what() << "(" << error.err() << ")" << std::endl;
std::cout << "Errorcode: " << errorcode << std::endl;
if (errorcode != CL_SUCCESS) { //...
std::cout << "Build Status: " << program.getBuildInfo<CL_PROGRAM_BUILD_STATUS>(devices[0]) << std::endl;
//std::cout << "Build Status: " << program.getBuildInfo<CL_PROGRAM_BUILD_STATUS>(devices[1]) << std::endl;
std::cout << "Build Options:" << program.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(devices[0]) << std::endl;
//std::cout << "Build Options:" << program.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(devices[1]) << std::endl;
std::cout << "Build Log:" << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]) << std::endl;
//std::cout << "Build Log:" << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[1]) << std::endl;
}
}
std::cin.get();
return 0;
}
Output:
Num Platforms: 1
Using platform: AMD Accelerated Parallel Processing
Num Devices: 2
Using device: Hawaii
Building Program
=> Exception.
There appears an uncaught exception which is strange, because all build error should be caught.
The ndrange_1D(1) is just for testing purposes (and to produce an acceptable amount of dummy output).
The device (AMD R9 390X) is OpenCL 2.0 capable.
Any ideas how to fix this?
EDIT:
Even not using exceptions and using errorcodes throws this an exception!

Accessing member variable of a class containing a thread

//function creating my class and thread
extractor = new FeatureExtractor(receiveBufferCurrent);
HANDLE hth1;
unsigned uiThread1ID;
hth1 = (HANDLE)_beginthreadex(NULL,
0,
FeatureExtractor::ThreadStaticEntryPoint,
extractor,
CREATE_SUSPENDED,
&uiThread1ID);
//Header file
class FeatureExtractor
{
private:
float sensorData[200][10];
public:
FeatureExtractor(float receiveBufferCurrent[][10]);
~FeatureExtractor();
//Thread for parallel input and motion detection
static unsigned int __stdcall ThreadStaticEntryPoint(void * pThis);
void ThreadEntryPoint();
void outputTest();
};
FeatureExtractor::FeatureExtractor(float receiveBufferCurrent[][10])
{
memcpy(sensorData, receiveBufferCurrent, sizeof(sensorData));
}
unsigned __stdcall FeatureExtractor::ThreadStaticEntryPoint(void * pThis)
{
FeatureExtractor * pthX = (FeatureExtractor*)pThis;
pthX->ThreadEntryPoint();
return 1;
}
void FeatureExtractor::ThreadEntryPoint()
{
outputTest();
}
//output function
for (int i = 0; i < 200; i = i + 50)
{
std::cout << "-------------------------------------------------------------------------" << std::endl;
std::cout << "AccelX=" << sensorData[i][1] << ", AccelY=" << sensorData[i][2] << ", AccelZ=" << sensorData[i][3] << std::endl;
std::cout << std::endl;
std::cout << "MagX=" << sensorData[i][4] << ", MagY=" << sensorData[i][5] << ", MagZ=" << sensorData[i][6] << std::endl;
std::cout << std::endl;
std::cout << "GyroX=" << sensorData[i][7] << ", GyroY=" << sensorData[i][8] << ", GyroZ=" << sensorData[i][9] << std::endl;
std::cout << std::endl;
std::cout << "-------------------------------------------------------------------------" << std::endl;
}
I have some problem with accessing the float array "sensorData" inside a thread.
If I output the sensorData array inside the constructor everything is fine but if I access the array from inside my thread my array just contains -1.58839967e+038 which I guess means that I cannot access my array in this way from a thread.
What am I doing wrong?
I got the thread code from a tutorial which accesses class member variables in the same way although just integers not multi dimensional arrays.
I tried to minimize the length of my code snippets while keeping the important parts, I'm thankful for anybody taking the time to analyze my code!
I found the solution myself now.
WaitForSingleObject(hth1, INFINITE);
Once I waited for my thread the issue has been resolved.
The issue occurred because I deleted my class before it could finish execution.
It also worked to simple remove the delete statement.

How to create a multithread loop in C++?

I have an application that is run in a for loop:
// initialization
for (std::vector< VerifObj >::const_iterator itVOV = verifObjVector.begin(); itVOV != verifObjVector.end(); itVOV++)
{
// run my application for itVOV
std::cout << "\b\b\b\b" << std::setw(3) << static_cast< int >(100.f * ++photoCntr / verifObjVecSz) << "%"
<< std::flush;
std::this_thread::sleep_for(std::chrono::milliseconds(30));
}
std::cout << "\b\b\b\b" << std::setw(3) << "100%" << std::endl;
Because each iteration is taking some minutes, I thought to make it multi thread, so it can run faster. I am a beginner in multi threading so, I am asking how to do it?