I'll warn you in advance my written english it is not good, so please have some patience because I'll do a lot of errors.
I need to expose the graphic card in order to do some benchmark with parallel algorithms on finite element analysis. I downloaded the intel sdk at this link https://software.intel.com/en-us/intel-opencl .
I am using Ubuntu 16.10, so i followed all the instruction as explained in this post https://streamcomputing.eu/blog/2011-06-24/install-opencl-on-debianubuntu-orderly/ .
When i run a simple algorithm wich checks all the device, it only recognizes the cpu, failing to find the graphic card. The same program works well on a mac (because OpenCL is in the stack of course).
// includes...
int main(int argc, const char * argv[])
{
// See what standard OpenCL sees
std::vector<cl::Platform> platforms;
// Get platform
cl::Platform::get(&platforms);
// Temp
std::string s;
// Where the GPU lies
cl::Device gpudevice;
// Found a GPU
bool gpufound = false;
std::cout << "**** OPENCL ****" << std::endl;
// See if we have a GPU
for (auto p : platforms)
{
std::vector<cl::Device> devices;
p.getDevices(CL_DEVICE_TYPE_ALL, &devices);
for (auto d : devices)
{
std::size_t i = 4;
d.getInfo(CL_DEVICE_TYPE, &i);
std::cout << "> Device type " <<
(i & CL_DEVICE_TYPE_CPU ? "CPU" : "") <<
(i & CL_DEVICE_TYPE_GPU ? "GPU" : "") <<
(i & CL_DEVICE_TYPE_ACCELERATOR ? "ACCELERATOR" : "");
if (i & CL_DEVICE_TYPE_GPU)
{
gpudevice = d;
gpufound = true;
}
std::cout << " Version " << s << std::endl;
}
}
if (!gpufound)
{
std::cout << "NO GPU FOUND. ABORTING." << std::endl;
return 1;
}
// Do other things...
the output is:
/home/andrea/Dropbox/fem/SiNDy/clfem/cmake-build-debug/vector_sycl
**** OPENCL ****
> Device type CPU Version
NO GPU FOUND. ABORTING.
Process finished with exit code 1
I tried to add the current user in the video group, i also tried to install Intel Media Server Studio following the instructions coming with the package but I could not build the kernel because of some compile errors.
I also updated all the drivers with the automatic software update of Ubuntu, but still the GC is not found.
Maybe you want to try beignet, which is an OpenCL implementation for IvyBridge+ iGPUs. There are packages of beignet for Ubuntu 16.10. To be more precise, I think you are looking for the packages beignet-dev and beignet-opencl-icd. Test it yourself since I have no Ubuntu installation currently available. (However, beignet itself works pretty well on my Intel HD Graphics 520 and Antergos/Arch Linux)
Related
We have a working library that uses LibTorch 1.5.0, built with CUDA 10.0 which runs as expected.
We are working on upgrading to CUDA 10.2 for various non-PyTorch related reasons. We noticed that when we run LibTorch inference on the newly compiled LibTorch (compiled exactly the same, except changing to CUDA 10.2), the runtime is about 20x slower.
We also checked it using the precompiled binaries.
This was tested on 3 different machines using 3 different GPUs (Tesla T4, GTX980 & P1000) and all gives consistent ~20x slower on CUDA 10.2
(Both on Windows 10 & Ubuntu 16.04), all with the latest drivers and on 3 different torch scripts (of the same architecture)
I've simplified the code to be extremely minimal without external dependencies other than Torch
int main(int argc, char** argv)
{
// Initialize CUDA device 0
cudaSetDevice(0);
std::string networkPath = DEFAULT_TORCH_SCRIPT;
if (argc > 1)
{
networkPath = argv[1];
}
auto jitModule = std::make_shared<torch::jit::Module>(torch::jit::load(networkPath, torch::kCUDA));
if (jitModule == nullptr)
{
std::cerr << "Failed creating module" << std::endl;
return EXIT_FAILURE;
}
// Meaningless data, just something to pass to the module to run on
// PATCH_HEIGHT & WIDTH are defined as 256
uint8_t* data = new uint8_t[PATCH_HEIGHT * PATCH_WIDTH * 3];
memset(data, 0, PATCH_HEIGHT * PATCH_WIDTH * 3);
auto stream = at::cuda::getStreamFromPool(true, 0);
bool res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
std::cout << "Warmed up" << std::endl;
res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
delete[] data;
return 0;
}
// Inference function
bool infer(std::shared_ptr<JitModule>& jitModule, at::cuda::CUDAStream& stream, const uint8_t* inputData, int width, int height)
{
std::vector<torch::jit::IValue> tensorInput;
// This function simply uses cudaMemcpy to copy to device and create a torch::Tensor from that data
// I can paste it if it's relevant but didn't now to keep as clean as possible
if (!prepareInput(inputData, width, height, tensorInput, stream))
{
return false;
}
// Reduce memory usage, without gradients
torch::NoGradGuard noGrad;
{
at::cuda::CUDAStreamGuard streamGuard(stream);
auto totalTimeStart = std::chrono::high_resolution_clock::now();
jitModule->forward(tensorInput);
// The synchronize here is just for timing sake, not use in production
cudaStreamSynchronize(stream.stream());
auto totalTimeStop = std::chrono::high_resolution_clock::now();
printf("forward sync time = %.3f milliseconds\n",
std::chrono::duration<double, std::milli>(totalTimeStop - totalTimeStart).count());
}
return true;
}
When compiling this with Torch that was compiled using CUDA 10.0 we get a runtime of 18 ms and when we run it with Torch compiled with CUDA 10.2, we get a runtime of 430 ms
Any thoughts on that?
This issue was also posted on PyTorch Forums.
Issue on GitHub
UPDATE
I profiled this small program using both CUDAs
It seems that both use very different kernels
96.5% of the 10.2 computes are conv2d_grouped_direct_kernel which takes ~60-100ms on my P1000
where as the top kernels in the 10.0 run are
47.1% - cudnn::detail::implicit_convolve_sgemm (~1.5 ms)
23.1% - maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt (~0.4 ms)
8.5% - maxwell_scudnn_128x32_relu_small_nn (~0.4ms)
so it's easy to see where the time difference comes from. Now the question is, why.
I cloned https://github.com/codeplaysoftware/computecpp-sdk.git and modified the computecpp-sdk/samples/accessors/accessors.cpp file.
I just added std::cout << "SYCL exception caught: " << e.get_cl_code() << '\n';.
See the fully modified code:
/***************************************************************************
*
* Copyright (C) 2016 Codeplay Software Limited
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* For your convenience, a copy of the License has been included in this
* repository.
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Codeplay's ComputeCpp SDK
*
* accessor.cpp
*
* Description:
* Sample code that illustrates how to make data available on a device
* using accessors in SYCL.
*
**************************************************************************/
#include <CL/sycl.hpp>
#include <iostream>
using namespace cl::sycl;
int main() {
/* We define the data to be passed to the device. */
int data = 5;
/* The scope we create here defines the lifetime of the buffer object, in SYCL
* the lifetime of the buffer object dictates synchronization using RAII. */
try {
/* We can also create a queue that uses the default selector in
* the queue's default constructor. */
queue myQueue;
/* We define a buffer in order to maintain data across the host and one or
* more devices. We construct this buffer with the address of the data
* defined above and a range specifying a single element. */
buffer<int, 1> buf(&data, range<1>(1));
myQueue.submit([&](handler& cgh) {
/* We define accessors for requiring access to a buffer on the host or on
* a device. Accessors are are like pointers to data we can use in
* kernels to access the data. When constructing the accessor you must
* specify the access target and mode. SYCL also provides the
* get_access() as a buffer member function, which only requires an
* access mode - in this case access::mode::read_write.
* (make_access<>() has a second template argument which defaults
* to access::mode::global) */
auto ptr = buf.get_access<access::mode::read_write>(cgh);
cgh.single_task<class multiply>([=]() {
/* We use the subscript operator of the accessor constructed above to
* read the value, multiply it by itself and then write it back to the
* accessor again. */
ptr[0] = ptr[0] * ptr[0];
});
});
/* queue::wait() will block until kernel execution finishes,
* successfully or otherwise. */
myQueue.wait();
} catch (exception const& e) {
std::cout << "SYCL exception caught: " << e.what() << '\n';
std::cout << "SYCL exception caught: " << e.get_cl_code() << '\n';
return 2;
}
/* We check that the result is correct. */
if (data == 25) {
std::cout << "Hurray! 5 * 5 is " << data << '\n';
return 0;
} else {
std::cout << "Oops! Something went wrong... 5 * 5 is not " << data << "!\n";
return 1;
}
}
After building I executed the binary and got the following error output:
$ ./accessors
./accessors: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/lib/libComputeCpp.so)
./accessors: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/lib/libComputeCpp.so)
./accessors: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/lib/libComputeCpp.so)
SYCL exception caught: Error: [ComputeCpp:RT0101] Failed to create kernel ((Kernel Name: SYCL_class_multiply))
SYCL exception caught: -45
SYCL Runtime closed with the following errors:
SYCL objects are still alive while the runtime is shutting down
This probably indicates that a SYCL object was created but not properly destroyed.
terminate called without an active exception
Aborted (core dumped)
Hardware configuration is given below:
$ /usr/local/computecpp/bin/computecpp_info /usr/local/computecpp/bin/computecpp_info: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/bin/computecpp_info) /usr/local/computecpp/bin/computecpp_info: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/bin/computecpp_info)
********************************************************************************
ComputeCpp Info (CE 0.7.0)
********************************************************************************
Toolchain information:
GLIBC version: 2.19 GLIBCXX: 20150426 This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 3 devices matching: platform : <any> device type : <any>
-------------------------------------------------------------------------------- Device 0:
Device is supported : NO - Device does not support SPIR CL_DEVICE_NAME : GeForce GTX 750 Ti CL_DEVICE_VENDOR : NVIDIA Corporation CL_DRIVER_VERSION : 384.111 CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
-------------------------------------------------------------------------------- Device 1:
Device is supported : UNTESTED - Device not tested on this OS CL_DEVICE_NAME : Intel(R) HD Graphics CL_DEVICE_VENDOR : Intel(R) Corporation CL_DRIVER_VERSION : r5.0.63503 CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
-------------------------------------------------------------------------------- Device 2:
Device is supported : YES - Tested internally by Codeplay Software Ltd. CL_DEVICE_NAME : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz CL_DEVICE_VENDOR : Intel(R) Corporation CL_DRIVER_VERSION :
1.2.0.475 CL_DEVICE_TYPE : CL_DEVICE_TYPE_CPU
If you encounter problems when using any of these OpenCL devices, please consult this website for known issues: https://computecpp.codeplay.com/releases/v0.7.0/platform-support-notes
********************************************************************************
Please help to understand the error and to solve the same. Let me know if any more information is needed.
I would like to run this sample code on my NVidia GPU.
ComputeCpp, an implementation of the open standard SYCL, outputs SPIR instructions by default, the NVidia OpenCL implementation is not able to consume SPIR instructions.
Instead you will need to use ComputeCpp to output PTX instructions that can be understood by the NVidia hardware.
To do this, add the argument "-DCOMPUTECPP_BITCODE=ptx64" when making your cmake call using the sample code project from GitHub.
The FindComputeCpp.cmake file in this project takes this flag and gives the compiler instructions to output PTX. If you would like to do this with your own project you can take the relevant section from the FindComputeCpp.cmake file.
I'm trying to write an OpenCL wrapper in C++.
Yesterday I was working on my Windows 10 machine (NVIDIA GTX970 Ti, latest NVIDIA GeForce drivers I believe) and my code worked flawless.
Today, I'm trying it out on my laptop (Arch Linux, AMD Radeon R7 M265, Mesa 17.3.3) and I get a segfault when trying to create a command queue.
Here's the GDB backtrace:
#0 0x00007f361119db80 in ?? () from /usr/lib/libMesaOpenCL.so.1
#1 0x00007f36125dacb1 in clCreateCommandQueueWithProperties () from /usr/lib/libOpenCL.so.1
#2 0x0000557b2877dfec in OpenCL::createCommandQueue (ctx=..., dev=..., outOfOrderExec=false, profiling=false) at /home/***/OpenCL/Util.cpp:296
#3 0x0000557b2876f0cf in main (argc=1, argv=0x7ffd04fcdac8) at /home/***/main.cpp:27
#4 0x00007f361194cf4a in __libc_start_main () from /usr/lib/libc.so.6
#5 0x0000557b2876ecfa in _start ()
(I've censored part of the paths)
Here's the code that's producing the error:
CommandQueue createCommandQueue(Context ctx, Device dev, bool outOfOrderExec, bool profiling) noexcept
{
cl_command_queue_properties props [3]= {CL_QUEUE_PROPERTIES, 0, 0};
if (outOfOrderExec)
{
props[1] |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
}
if (profiling)
{
props[1] |= CL_QUEUE_PROFILING_ENABLE;
}
int error = CL_SUCCESS;
cl_command_queue queue = clCreateCommandQueueWithProperties(ctx.get(), dev.get(), props, &error);
if (error != CL_SUCCESS)
{
std::cerr << "Error while creating command queue: " << OpenCL::getErrorString(error) << std::endl;
}
CommandQueue commQueue = CommandQueue(queue);
Session::get().registerQueue(commQueue);
return commQueue;
}
The line with clCreateCommandQueueWithProperties is where the segfault happens.
Context is a wrapper class for a cl_context, Context::get() returns the original cl_context:
class Context
{
private:
...
cl_context context;
public:
...
cl_context get() const noexcept;
...
};
Device is a wrapper for cl_device, Device::get() also returns the cl_device:
class Device
{
private:
...
cl_device_type type;
cl_device_id id;
public:
...
cl_device_id get() const noexcept;
cl_device_type getType () const noexcept;
...
};
Here's the main function:
int main (int argc, char* argv [])
{
OpenCL::Session::get().init();
for (const std::string& deviceAddress : OpenCL::Session::get().getAddresses())
{
std::cout << "[" << deviceAddress << "]: " << OpenCL::Session::get().getDevice(deviceAddress);
}
OpenCL::Context ctx = OpenCL::getContext();
std::cout << "OpenCL version: " << ctx.getVersionString() << std::endl;
OpenCL::Kernel kernel = OpenCL::createKernel(OpenCL::createProgram("src/Kernels/Hello.cl", ctx), "SAXPY");
OpenCL::CommandQueue queue = OpenCL::createCommandQueue(ctx, OpenCL::Session::get().getDevice(ctx.getAssociatedDevices()[0]));
unsigned int testDataSize = 1 << 13;
std::vector <float> a = std::vector <float> (testDataSize);
std::vector <float> b = std::vector <float> (testDataSize);
for (int i = 0; i < testDataSize; i++)
{
a[i] = static_cast<float>(i);
b[i] = 0.0;
}
OpenCL::Buffer aBuffer = OpenCL::allocateBuffer(ctx, a.data(), sizeof(float), a.size());
OpenCL::Buffer bBuffer = OpenCL::allocateBuffer(ctx, b.data(), sizeof(float), b.size());
kernel.setArgument(0, aBuffer);
kernel.setArgument(1, bBuffer);
kernel.setArgument(2, 2.0f);
OpenCL::Event saxpy_event = queue.enqueue(kernel, {testDataSize});
OpenCL::Event read_event = queue.read(bBuffer, b.data(), bBuffer.size());
std::cout << "SAXPY kernel took " << saxpy_event.getRunTime() << "ns to complete." << std::endl;
std::cout << "Read took " << read_event.getRunTime() << "ns to complete." << std::endl;
OpenCL::Session::get().cleanup();
return 0;
}
(The profiling won't work because I've disabled it (thinking that was the cause of the problem), re-enabling profiling doesn't fix the issue however).
I'm using CLion, so here's a screenshot of my debugging window:
Finally here's the console output of the program:
/home/***/cmake-build-debug/Main
[gpu0:0]: AMD - AMD OLAND (DRM 2.50.0 / 4.14.15-1-ARCH, LLVM 5.0.1): 6 compute units # 825MHz
OpenCL version: OpenCL 1.1 Mesa 17.3.3
Signal: SIGSEGV (Segmentation fault)
The context and device objects all seem to have been created without any issues so I really have no idea what's causing the segfault.
Is it possible that I've found a bug in the Mesa driver, or am I missing something obvious?
Edit: This person seems to have had a similar problem, unfortunately, his problem was just a C-style-forget-to-allocate-memory-problem.
2nd Edit: I may have found a possible cause of this problem, CMake is finding, using and linking against OpenCL 2.0 while my GPU only supports OpenCL 1.1. I'll look into this.
I haven't found a way to roll back to OpenCL 1.1 on Arch Linux, but clinfo seems to be working fine and so is blender (which depends on OpenCL), so I don't think this is the problem.
Here's the output from clinfo:
Number of platforms 1
Platform Name Clover
Platform Vendor Mesa
Platform Version OpenCL 1.1 Mesa 17.3.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix MESA
Platform Name Clover
Number of devices 1
Device Name AMD OLAND (DRM 2.50.0 / 4.14.15-1-ARCH, LLVM 5.0.1)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 17.3.3
Driver Version 17.3.3
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Available Yes
Device Profile FULL_PROFILE
Max compute units 6
Max clock frequency 825MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Compiler Available Yes
Preferred work group size multiple 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1503238553 (1.4GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support No
Local memory type Local
Local memory size 32768 (32KiB)
Max constant buffer size 1503238553 (1.4GiB)
Max number of constant args 16
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Profiling timer resolution 0ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Clover
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [MESA]
clCreateContext(NULL, ...) [default] Success [MESA]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Clover
Device Name AMD OLAND (DRM 2.50.0 / 4.14.15-1-ARCH, LLVM 5.0.1)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Clover
Device Name AMD OLAND (DRM 2.50.0 / 4.14.15-1-ARCH, LLVM 5.0.1)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Clover
Device Name AMD OLAND (DRM 2.50.0 / 4.14.15-1-ARCH, LLVM 5.0.1)
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.12
ICD loader Profile OpenCL 2.2
3rd Edit: I've just run the code on my NVIDIA machine, works without issue, this is what the console shows:
[gpu0:0]: NVIDIA Corporation - GeForce GTX 970: 13 compute units # 1253MHz
OpenCL version: OpenCL 1.2 CUDA 9.1.75
SAXPY kernel took 2368149686ns to complete.
Read took 2368158390ns to complete.
I've also fixed the 2 things Andreas mentioned
clCreateCommandQueueWithProperties got added in OpenCL 2.0. You should not use it with platforms and devices that are less than version 2.0 (such as 1.1 and 1.2 shown in your logs).
clCreateCommandQueue is deprecated in OpenCL 1.2
It means you can use it both with or without properties.
I have an Nvidia GTX 970M GPU & I am trying to run a face detection algorithm in c++ that runs on the GPU using OpenCL.
The function where this error occurs is :
ocl::OclCascadeClassifier::detectMultiScale()
The error I get is :
OpenCV Error: Assertion failed (localThreads[0] * localThreads[1] * localThreads[2] <= kernelWorkGroupSize) in cv::ocl::openCLVerifyKernel
I know that this problem is related to the GPU of the device but I do not know how to fix this. I have tried using OpenCV versions 2 and 3 but both give the same problem.
The problem was that it was trying to use the Intel HD Graphics GPU instead of the Nvidia GPU. I solved this by choosing the Nvidia GPU as the OpenCL Device.
The code I used was :
cv::ocl::DevicesInfo devInfo;
int res = cv::ocl::getOpenCLDevices(devInfo);
if (res == 0)
{
std::cerr << "There is no OPENCL Here !" << std::endl;
}
else
{
for (unsigned int i = 0; i < devInfo.size(); ++i)
{
std::cout << "Device : " << devInfo[i]->deviceName << " is present" << std::endl;
}
}
cv::ocl::setDevice(devInfo[1]);
How do I specify context (platform/device information) when using OpenCL calls in place of OpenCV calls when using OpenCL library for OpenCV 2.4.8 in C++?
I could do it for OpenCV version 2.4.6 but I could not work it out for OpenCV version 2.4.8
Here's what I did for ver. 2.4.6:
std::vector<ocl::Info> oclinfo;
int ocld = ocl::getDevice(oclinfo);
cout<< ocld;
for ( int i=0; i< oclinfo.size(); i++ )
{
cout << "OpenCL Device" << i << ":" << oclinfo[i].DeviceName[0] << endl;
}
ocl::setDevice(oclinfo[0], 0);
I've not used version 2.4.8, but for version 2.4.9 you can use this link
also you can set environment variable OPENCV_OPENCL_DEVICE for default config.