Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h

Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h - c++

I am new to Halide. I have been playing around with the tutorials to get a feel for the language. Now, I am writing a small demo app to run from command line on OSX.
My goal is to perform a pixel-by-pixel operation on an image, schedule it on the GPU and measure the performance. I have tried a couple things which I want to share here and have a few questions about the next steps.
First approach
I scheduled the algorithm on GPU with Target being OpenGL, but because I could not access the GPU memory to write to a file, in the Halide routine, I copied the output to the CPU by creating Func cpu_out similar to the glsl sample app in the Halide repo
pixel_operation_cpu_out.cpp
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
const int _number_of_channels = 4;
int main(int argc, char** argv)
{
ImageParam input8(UInt(8), 3);
input8
.set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
.set_stride(2, 1); // stride in dimension 2 (c) is one
Var x("x"), y("y"), c("c");
// algorithm
Func input;
input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
clamp(y, input8.top(), input8.bottom()),
clamp(c, 0, _number_of_channels))) / 255.0f;
Func pixel_operation;
// calculate the corresponding value for input(x, y, c) after doing a
// pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
// This operation is not location dependent, eg: brighten
Func out;
out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
out.output_buffer().set_bounds(2, 0, _number_of_channels);
// schedule
out.compute_root();
out.reorder(c, x, y)
.bound(c, 0, _number_of_channels)
.unroll(c);
// Schedule for GLSL
out.glsl(x, y, c);
Target target = get_target_from_environment();
target.set_feature(Target::OpenGL);
// create a cpu_out Func to copy over the data in Func out from GPU to CPU
std::vector<Argument> args = {input8};
Func cpu_out;
cpu_out(x, y, c) = out(x, y, c);
cpu_out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
cpu_out.output_buffer().set_bounds(2, 0, _number_of_channels);
cpu_out.compile_to_file("pixel_operation_cpu_out", args, target);
return 0;
}
Since I compile this AOT, I make a function call in my main() for it. main() resides in another file.
main_file.cpp
Note: the Image class used here is the same as the one in this Halide sample app
int main()
{
char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);
Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
input.buf.host = &pixelsRGBA[0];
unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
output.buf.host = &outputPixelsRGBA[0];
double best = benchmark(100, 10, [&]() {
pixel_operation_cpu_out(&input.buf, &output.buf);
});
char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
}
This works just fine and gives me the output I expect. From what I understand, cpu_out makes the values in out available on the CPU memory, which is why I am able to access these values by accessing output.buf.host in main_file.cpp
Second approach:
The second thing I tried was to not do the copy to host from device in the Halide schedule by creating Func cpu_out, instead using copy_to_host function in main_file.cpp.
pixel_operation_gpu_out.cpp
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
const int _number_of_channels = 4;
int main(int argc, char** argv)
{
ImageParam input8(UInt(8), 3);
input8
.set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
.set_stride(2, 1); // stride in dimension 2 (c) is one
Var x("x"), y("y"), c("c");
// algorithm
Func input;
input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
clamp(y, input8.top(), input8.bottom()),
clamp(c, 0, _number_of_channels))) / 255.0f;
Func pixel_operation;
// calculate the corresponding value for input(x, y, c) after doing a
// pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
// This operation is not location dependent, eg: brighten
Func out;
out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
out.output_buffer().set_bounds(2, 0, _number_of_channels);
// schedule
out.compute_root();
out.reorder(c, x, y)
.bound(c, 0, _number_of_channels)
.unroll(c);
// Schedule for GLSL
out.glsl(x, y, c);
Target target = get_target_from_environment();
target.set_feature(Target::OpenGL);
std::vector<Argument> args = {input8};
out.compile_to_file("pixel_operation_gpu_out", args, target);
return 0;
}
main_file.cpp
#include "pixel_operation_gpu_out.h"
#include "runtime/HalideRuntime.h"
int main()
{
char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);
Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
input.buf.host = &pixelsRGBA[0];
unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
output.buf.host = &outputPixelsRGBA[0];
double best = benchmark(100, 10, [&]() {
pixel_operation_gpu_out(&input.buf, &output.buf);
});
int status = halide_copy_to_host(NULL, &output.buf);
char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
return 0;
}
So, now, what I think is happening is that pixel_operation_gpu_out is keeping output.buf on the GPU and when I do copy_to_host, that's when I get the memory copied over to the CPU. This program gives me the expected output as well.
Questions:
The second approach is much slower than the first approach. The slow part is not in the benchmarked part though. For example, for first approach, I get 17ms as benchmarked time for a 4k image. For the same image, in the second approach, I get the benchmarked time as 22us and the time taken for copy_to_host is 10s. I'm not sure if this behavior is expected since both approach 1 and 2 are essentially doing the same thing.
The next thing I tried was to use [HalideRuntimeOpenGL.h][3] and link textures to input and output buffers to be able to draw directly to a OpenGL context from main_file.cpp instead of saving to a jpeg file. However, I could find no examples to figure out how to use the functions in HalideRuntimeOpenGL.h and whatever things I did try on my own were always giving me run time errors which I could not figure out how to solve. If anyone has any resources they can point me to, that will be great.
Also, any feedback on the code I have above are welcome too. I know it works and is doing what I want but it could be the completely wrong way of doing it and I wouldn't know any better.

Mostly likely the reason for the 10s to copy memory back is because the GPU API has queued all the kernel invocations and then waits on them to finish when halide_copy_to_host is called. You can call halide_device_sync inside the benchmark timing after running all the compute calls to handle get the compute time inside the loop without the copy back time.
I cannot tell from the code how many times the kernel is being run from this code. (My guess is 100, but it may be that those arguments to benchmark setup some sort of parameterization where it tries to run it as many times as need be to get significance. If so, that is a problem because the queuing call is really fast but the compute is of course async. If this is the case, you can do things like queue ten calls and then call halide_device_sync and play with the number "10" to get a real picture of how long it takes.)

Related

Using halide with HDR images represented as float array

that's my first post here so sorry if I do something wrong:). I will try to do my best.
I currently working on my HDR image processing program, and I wonna implement some basing TMO using Halide. Problem is all my images are represented as float array (with order like: b1,g1,r1,a1, b2,g2,r2,a2, ... ). Using Halide to process image require Halide::Image class. Problem is I don't know how to pass those data there.
Anyone can help, or have same problem and know the answer?
Edit:
Finally got it! I need to set stride on input and output buffer in generator. Thx all for help:-)
Edit:
I tried two different ways:
int halideOperations( float data[] , int size, int width,int heighy )
{
buffer_t input_buf = { 0 };
input_buf.host = &data[0];
}
or:
int halideOperations( float data[] , int size, int width,int heighy )
{
Halide::Image(Halide::Type::Float, x, y, 0, 0, data);
}
I was thinking about editing Halide.h file and changing uint8_t * host to float_t * host but i don't think it's good idea.
Edit:
I tried using code below with my float image (RGBA):
AOT function generation:
int main(int arg, char ** argv)
{
Halide::ImageParam img(Halide::type_of<float>(), 3);
Halide::Func f;
Halide::Var x, y, c;
f(x, y, c) = Halide::pow(img(x,y,c), 2.f);
std::vector<Halide::Argument> arguments = { img };
f.compile_to_file("function", arguments);
return 0;
}
Proper code calling:
int halideOperations(float data[], int size, int width, int height)
{
buffer_t output_buf = { 0 };
buffer_t buf = { 0 };
buf.host = (uint8_t *)data;
float * output = new float[width * height * 4];
output_buf.host = (uint8_t*)(output);
output_buf.extent[0] = buf.extent[0] = width;
output_buf.extent[1] = buf.extent[1] = height;
output_buf.extent[2] = buf.extent[2] = 4;
output_buf.stride[0] = buf.stride[0] = 4;
output_buf.stride[1] = buf.stride[1] = width * 4;
output_buf.elem_size = buf.elem_size = sizeof(float);
function(&buf, &output_buf);
delete output;
return 1;
}
unfortunately I got crash with msg:
Error: Constraint violated: f0.stride.0 (4) == 1 (1)
I think something is wrong with this line: output_buf.stride[0] = buf.stride[0] = 4, but I'm not sure what should I change. Any tips?

If you are using buffer_t directly, you must cast the pointer assigned to host to a uint8_t * :
buf.host = (uint8_t *)&data[0]; // Often, can be just "(uint8_t *)data"
This is what you want to do if you are using Ahead-Of-Time (AOT) compilation and the data is not being allocated as part of the code which directly calls Halide. (Other methods discussed below control the storage allocation so they cannot take a pointer that is passed to them.)
If you are using either Halide::Image or Halide::Tools::Image, then the type casting is handled internally. The constructor used above for Halide::Image does't exist as Halide::Image is a template class where the underlying data type is a template parameter:
Halide::Image<float> image_storage(width, height, channels);
Note this will store the data in planar layout. Halide::Tools::Image is similar but has an option to do interleaved layout. (Personally, I try not to use either of these outside of small test programs. There is a long term plan to rationalize all of this which will proceed after the arbitrary dimension buffer_t branch is merged. Note also Halide::Image requires libHalide.a to be linked where Halide::Tools::Image does not and is header file only via including common/halide_image.h .)
There is also the Halide::Buffer class which is a wrapper on buffer_t that is useful in Just-In-Time (JIT) compilation. It can reference passed in storage and is not templated. However my guess is you want to use buffer_t directly and simply need the type cast to assign host. Also be sure to set the elem_size field of buffer_t to "sizeof(float)".
For an interleaved float buffer, you'll end up with something like:
buffer_t buf = {0};
buf.host = (uint8_t *)float_data; // Might also need const_cast
// If the buffer doesn't start at (0, 0), then assign mins
buf.extent[0] = width; // In elements, not bytes
buf.extent[1] = height; // In elements, not bytes
buf.extent[2] = 3; // Assuming RGB
// No need to assign additional extents as they were init'ed to zero above
buf.stride[0] = 3; // RGB interleaved
buf.stride[1] = width * 3; // Assuming no line padding
buf.stride[2] = 1; // Channel interleaved
buf.elem_size = sizeof(float);
You will also need to pay attention to the bounds in the Halide code itself. Probably best to look at the set_stride and bound calls in tutorial/lesson_16_rgb_generate.cpp for information on that.

In addition to Zalman's answer above you also have to specify the strides for the inputs and outputs when defining your Halide function like below:
int main(int arg, char ** argv)
{
Halide::ImageParam img(Halide::type_of<float>(), 3);
Halide::Func f;
Halide::Var x, y, c;
f(x, y, c) = Halide::pow(img(x,y,c), 2.f);
// You need the following
f.set_stride(0, f.output_buffer().extent(2));
f.set_stride(1, f.output_buffer().extent(0) * f.output_buffer().extent(2));
img.set_stride(0, img.extent(2));
img.set_stride(1, img.extent(2) *img.extent(0));
// <- up to here
std::vector<Halide::Argument> arguments = { img };
f.compile_to_file("function", arguments);
return 0;
}
then your code should run.

C++: Simple CUDA volume reconstruction code crashing

I am currently working on a more comprehensive project involving CUDA. During the recent days I have been encountering errors that I have been desperately trying to bugfix. However, I couldn't figure it out, so I made up a minimal example now that shows the same behaviour. I have to say I am kind of new to CUDA. I am using Visual Studio 2015 and the CUDA Toolkit 7.5.
The program involves creating a 3D-volume on the GPU memory and then calculating values and writing them to the volume. I have tried to make the code as simple as possible:
First ist the main.cpp file:
#include "cuda_test.h"
int main() {
size_t const xDimension = 500;
size_t const yDimension = 500;
size_t const zDimension = 1000;
//allocate volume part memory on gpu
cudaPitchedPtr volume = ct::cuda::create3dVolumeOnGPU(xDimension, yDimension, zDimension);
//start reconstruction
ct::cuda::startReconstruction(volume,
xDimension,
yDimension,
zDimension);
return 0;
}
Then the cuda_test.h that is the header file for the actual .cu file:
#ifndef CT_CUDA
#define CT_CUDA
#include <cstdlib>
#include <stdio.h>
#include <cmath>
//CUDA
#include <cuda_runtime.h>
namespace ct {
namespace cuda {
cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize);
void startReconstruction(cudaPitchedPtr volume,
size_t xSize,
size_t ySize,
size_t zSize);
}
}
#endif
And then the cuda_test.cu file that contains the actual function implementations:
#include "cuda_test.h"
namespace ct {
namespace cuda {
cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize) {
cudaExtent extent = make_cudaExtent(xSize * sizeof(float), ySize, zSize);
cudaPitchedPtr ptr;
cudaMalloc3D(&ptr, extent);
printf("malloc3D: %s\n", cudaGetErrorString(cudaGetLastError()));
cudaMemset3D(ptr, 0, extent);
printf("memset: %s\n", cudaGetErrorString(cudaGetLastError()));
return ptr;
}
__device__ void addToVolumeElement(cudaPitchedPtr volumePtr, size_t ySize, size_t xCoord, size_t yCoord, size_t zCoord, float value) {
char* devicePtr = (char*)(volumePtr.ptr);
//z * xSize * ySize + y * xSize + x
size_t pitch = volumePtr.pitch;
size_t slicePitch = pitch * ySize;
char* slice = devicePtr + zCoord*slicePitch;
float* row = (float*)(slice + yCoord * pitch);
row[xCoord] += value;
}
__global__ void reconstructionKernel(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) {
size_t xIndex = blockIdx.x;
size_t yIndex = blockIdx.y;
size_t zIndex = blockIdx.z;
if (xIndex == 0 && yIndex == 0 && zIndex == 0) {
printf("kernel start\n");
}
//just make sure we're inside the volume bounds
if (xIndex < xSize && yIndex < ySize && zIndex < zSize) {
//float value = z;
float value = sqrt(sqrt(sqrt(5.3))) * sqrt(sqrt(sqrt(1.2))) * sqrt(sqrt(sqrt(10.8))) + 501 * 0.125 * 0.786 / 5.3;
addToVolumeElement(volumePtr, ySize, xIndex, yIndex, zIndex, value);
}
if (xIndex == 0 && yIndex == 0 && zIndex == 0) {
printf("kernel end\n");
}
}
void startReconstruction(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) {
dim3 blocks(xSize, ySize, zSize);
reconstructionKernel <<< blocks, 1 >>>(volumePtr,
xSize,
ySize,
zSize);
printf("Kernel launch: %s\n", cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
printf("Device synchronise: %s\n", cudaGetErrorString(cudaGetLastError()));
}
}
}
The function create3dVolumeOnGPU allocates a 3-dimensional "volume" in the gpu memory and returns a pointer to it. This is a host function. The second host function is startReconstruction. The only thing it does is launching the actual kernel with as many blocks as there are voxels in the volume. The kernel function is reconstructionKernel. It just calculates an arbitrary value out of some constants and then calls addToVolumeElement (device function) to write the result in the corresponding voxel (adding it).
Now, the problem is that it crashes. If I launch with debugger (NSight), NSight interrupts giving the error message:
CUDA grid launch failed: CUcontext: 2358451327088 CUmodule: 2358541519888 Function: _ZN2ct4cuda20reconstructionKernelE14cudaPitchedPtryyy
The console outputs:
malloc3D: no error
memset: no error
kernel started
kernel end
If I launch in release mode the whole machine resets.
However, if I change the volume dimensions to be smaller it works, for example:
size_t const xDimension = 100;
size_t const yDimension = 100;
size_t const zDimension = 100;
However, the amount of free GPU memory should not be the problem (card has 4GB VRAM).
It would be nice if someone could have a look at it and maybe give me a tip what could cause the problem.

Now, the problem is that it crashes
It would be nice if someone could have a look at it and maybe give me a tip what could cause the problem.
I think it's likely you are running into a WDDM TDR issue. On windows, any time a kernel running on a WDDM GPU takes more than about 2 seconds to execute, you may run into the WDDM TDR watchdog (assuming you haven't made any changes to the watchdog).
Furthermore, launching kernels like this:
reconstructionKernel <<< blocks, 1 >>>(...);
where the threads-per-block number is 1, means that only one thread in each warp (and in each block) is active. But the GPU likes to have 32 active threads per warp. So the net effect is inefficient utilization of the GPU resources; perhaps as much as 97% of the GPU horsepower sits idle when you run kernels this way.
So if your code is flexible enough to allow this:
reconstructionKernel <<< blocks, 1 >>>(...);
or equivalently this:
reconstructionKernel <<< blocks/256, 256 >>>(...);
(this is just a representative example; I realize you have a multidimensional grid, and the above probably isn't exactly relevant for your case)
then the second invocation method will almost certainly be more efficient, leading to a shorter execution time for the same work.
So I believe when you tested your code with multiple threads per block, you did something like the above, and it reduced the execution time below the TDR limit.
That's a perfectly fine solution, but if you end up adding more work to your kernel (more total threads, or more work per thread) then you may run into the limit again. In that case, the linked article explains a possible work-around.
As an aside, kernel launch configurations like this:
kernel<<<1, ?>>>(...);
or this:
kernel<<<?, 1>>>(...);
are never recommended for high performance code on the GPU.

CUDA Optimization

I developed Pincushion Distortion using CUDA to support real time - more than 40 fps for 3680*2456 Image Sequences.
But it takes 130ms if I use CUDA - nVIDIA GeForce GT 610, 2GB DDR3.
But it takes only 60ms if I use CPU and OpenMP - Core i7 3.4GHz, QuadCore.
Please tell me what to do to speed up.
Thanks.
Full source can be downloaded here.
https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharing
https://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing
The codes are as follows.
__global__
void undistort(int N, float k, int width, int height, int depth, int pitch, float R, float L, unsigned char* in_bits, unsigned char* out_bits)
{
// Get the Index of the Array from GPU Grid/Block/Thread Index and Dimension.
int i, j;
i = blockIdx.y * blockDim.y + threadIdx.y;
j = blockIdx.x * blockDim.x + threadIdx.x;
// If Out of Array
if (i >= height || j >= width)
{
return;
}
// Calculating Undistortion Equation.
// In CPU, We used Fast Approximation equations of atan and sqrt - It makes 2 times faster.
// But In GPU, No need to use Approximation Functions as it is faster.
int cx = width * 0.5;
int cy = height * 0.5;
int xt = j - cx;
int yt = i - cy;
float distance = sqrt((float)(xt*xt + yt*yt));
float r = distance*k / R;
float theta = 1;
if (r == 0)
theta = 1;
else
theta = atan(r)/r;
theta = theta*L;
float tx = theta*xt + cx;
float ty = theta*yt + cy;
// When we correct the frame, its size will be greater than Original.
// So We should Crop it.
if (tx < 0)
tx = 0;
if (tx >= width)
tx = width - 1;
if (ty < 0)
ty = 0;
if (ty >= height)
ty = height - 1;
// Output the Result.
int ux = (int)(tx);
int uy = (int)(ty);
tx = tx - ux;
ty = ty - uy;
unsigned char *p = (unsigned char*)out_bits + i*pitch + j*depth;
unsigned char *q00 = (unsigned char*)in_bits + uy*pitch + ux*depth;
unsigned char *q01 = q00 + depth;
unsigned char *q10 = q00 + pitch;
unsigned char *q11 = q10 + depth;
unsigned char newVal[4] = {0};
for (int k = 0; k < depth; k++)
{
newVal[k] = (q00[k]*(1-tx)*(1-ty) + q01[k]*tx*(1-ty) + q10[k]*(1-tx)*ty + q11[k]*tx*ty);
memcpy(p + k, &newVal[k], 1);
}
}
void wideframe_correction(char* bits, int width, int height, int depth)
{
// Find the device.
// Initialize the nVIDIA Device.
cudaSetDevice(0);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0);
// This works for Calculating GPU Time.
cudaProfilerStart();
// This works for Measuring Total Time
long int dwTime = clock();
// Setting Distortion Parameters
// Note that Multiplying 0.5 works faster than divide into 2.
int cx = (int)(width * 0.5);
int cy = (int)(height * 0.5);
float k = -0.73f;
float R = sqrt((float)(cx*cx + cy*cy));
// Set the Radius of the Result.
float L = (float)(width<height ? width:height);
L = L/2.0f;
L = L/R;
L = L*L*L*0.3333f;
L = 1.0f/(1-L);
// Create the GPU Memory Pointers.
unsigned char* d_img_in = NULL;
unsigned char* d_img_out = NULL;
// Allocate the GPU Memory2D with pitch for fast performance.
size_t pitch;
cudaMallocPitch( (void**) &d_img_in, &pitch, width*depth, height );
cudaMallocPitch( (void**) &d_img_out, &pitch, width*depth, height );
_tprintf(_T("\nPitch : %d\n"), pitch);
// Copy RAM data to VRAM.
cudaMemcpy2D( d_img_in, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
cudaMemcpy2D( d_img_out, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
// Create Variables for Timing
cudaEvent_t startEvent, stopEvent;
cudaError_t err = cudaEventCreate(&startEvent, 0);
assert( err == cudaSuccess );
err = cudaEventCreate(&stopEvent, 0);
assert( err == cudaSuccess );
// Execution of the version using global memory
float elapsedTime;
cudaEventRecord(startEvent);
// Process image
dim3 dGrid(width / BLOCK_WIDTH + 1, height / BLOCK_HEIGHT + 1);
dim3 dBlock(BLOCK_WIDTH, BLOCK_HEIGHT);
undistort<<< dGrid, dBlock >>> (width*height, k, width, height, depth, pitch, R, L, d_img_in, d_img_out);
cudaThreadSynchronize();
cudaEventRecord(stopEvent);
cudaEventSynchronize( stopEvent );
// Estimate the GPU Time.
cudaEventElapsedTime( &elapsedTime, startEvent, stopEvent);
// Calculate the Total Time.
dwTime = clock() - dwTime;
// Save Image data from VRAM to RAM
cudaMemcpy2D( bits, width*depth,
d_img_out, pitch, width*depth, height,
cudaMemcpyDeviceToHost );
_tprintf(_T("GPU Processing Time(ms) : %d\n"), (int)elapsedTime);
_tprintf(_T("VRAM Memory Read/Write Time(ms) : %d\n"), dwTime - (int)elapsedTime);
_tprintf(_T("Total Time(ms) : %d\n"), dwTime );
// Free GPU Memory
cudaFree(d_img_in);
cudaFree(d_img_out);
cudaProfilerStop();
cudaDeviceReset();
}

i've not read the source code, but there is some things you can't pass through.
your GPU has nearly same performance as your CPU:
Adapt the follwing informations with your real GPU/CPU model.
Specification | GPU | CPU
----------------------------------------
Bandwith | 14,4 GB/sec | 25.6 GB/s
Flops | 155 (FMA) | 135
we can conclude that for memory bounded kernels your GPU will never be faster than your CPU.
GPU informations found here :
http://www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2
CPU informations found here : http://ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q=Intel%20Core%20i7%204770K
and here http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005

One does not simply optimize the code just by looking to the source. First of all, you should use Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler and see, which part of your code on GPU is the one taking too much time. You might wish to write a UnitTest first however, just to be sure that only the investigated part of your project is tested.
Additionally, you can use CallGrind http://valgrind.org/docs/manual/cl-manual.html to test your CPU code performance.
In general, this is not very surprising that your GPU "optimized" code is slower then "not optimized" one. CUDA cores are usually several times slower than CPU and you have to actually introduce a lot of parallelism to notice a significant speed-up.
EDIT, response to your comment:
As a unit testing framework I strongly recommend GoogleTest. Here you can learn how to use it. Apart from its obvious functionalities (code testing) it allows you to run only specific methods from your class interfaces for performance analysis.
In general, Nvidia profiler is just a tool that runs your code and tells you how much time each of your kernel consume. Please look to their documentation.
By "lot of parallelism" I meant: on your processor you can run 8 x 3.4GHz threads, your GPU has one SM (streaming multiprocessor) with 810MHz clock, lets say 1024 threads per SM (I do not have exact data, but you can run deviceQuery Nvidia script to know the exact parameters), therefore if your GPU code can run (3.4*8)/0.81 = 33 computations in parallel, you will achieve exactly nothing. Execution time of your CPU and GPU code will be the same (neglecting L-cache GPU memory copying, which is expensive). Conclusion: your GPU code should be able to compute at least ~ 40 operations in parallel to introduce any speed-up. On the other hand, lets say that you are able to fully use your GPU potential and you can keep all 1024 on your SM busy all the time. In that case your code will run only (0.81*1024)/(8*3.4) = 30 times faster (approximately, remember that we neglect GPU L-cache operations), which is impossible in most cases, because usually you are not able to parallelize your serial code with such efficiency.
Wish you good luck with your research!

Yes, put nvprof to good use, it's a great tool.
What I could see from your code...
1. Consider using linear thread blocks instead of flat blocks, it could save up some integer operations.
2. Manual correction of image borders and/or thread indices leads to massive divergence and/or impacts coalescing. Consider using texture fetches and/or pre-padding data.
3. memcpy single value from inside the kernel is generally a bad idea.
4. Try to minimize type conversions.

JOGL glTexSubImage2D eats up to 100% cpu and takes ages

I've faced with the issue that execution time for single gl.glTexSubImage2D() call takes 0.1-0.2sec while running on Linux and eating 100% of cpu. On mac it is all fine.
The call arguments are the following:
gl.glTexSubImage2D(GL.GL_TEXTURE_2D, 0, 0, 0, 1920, 1080, GL2.GL_RED, GL2.GL_UNSIGNED_SHORT, data);
Textures setup is the following:
void glCreateClearTex(GL gl, int target, int fmt, int format, int type, int filter, int w, int h, int val) {
float fval = 0;
int stride;
if (w == 0)
w = 1;
if (h == 0)
h = 1;
stride = 2/*2048*/ * 2;
ByteBuffer init = ByteBuffer.allocateDirect(stride * h/*2048*/);
glAdjustAlignment(gl, stride);
gl.glPixelStorei(GL2.GL_UNPACK_ROW_LENGTH, w);
gl.glTexImage2D(target, 0, fmt, w, h, 0, format, type, init);
gl.glTexParameterf(target, GL2.GL_TEXTURE_PRIORITY, 1.0f);
gl.glTexParameteri(target, GL2.GL_TEXTURE_MIN_FILTER, GL2.GL_LINEAR);
gl.glTexParameteri(target, GL2.GL_TEXTURE_MAG_FILTER, GL2.GL_LINEAR);
gl.glTexParameteri(target, GL2.GL_TEXTURE_WRAP_S, GL2.GL_CLAMP_TO_EDGE);
gl.glTexParameteri(target, GL2.GL_TEXTURE_WRAP_T, GL2.GL_CLAMP_TO_EDGE);
gl.glTexParameterfv(target, GL2.GL_TEXTURE_BORDER_COLOR, FloatBuffer.wrap(new float[] { fval, fval, fval, fval }));
}
Mplayer doing same work natively runs just fine. glxgears runs ok but also takes up 100%. This may be the sign of OpenGL setup issues but glxinfo and others report that it is hw rendering. Graphic Card is ATI FirePro.

I found the issue. Jogl has two variants of gl.glTexSubImage2D(). One is using data ptr to upload to pbo and later to GPU, another one - offset inside already prepared pbo. My mistake was that I uploaded data twice and this somehow caused major slowdown on linux.
So the fix is to upload data to pbo and then upload it to GPU with gl.glTexSubImage2D() using offset inside pbo.

glReadPixels store x, y values

I'm trying to store pixel data by using glReadPixels, but so far I managed to only store it one pixel at a time. I'm not sure if this is the way to go. I currently have this:
unsigned char pixels[3];
glReadPixels(50,50, 1, 1, GL_RGB, GL_UNSIGNED_BYTE, pixels);
What would be a good way to store it in an array, so that I can get the values like this:
pixels[20][50][0]; // x=20 y=50 -> R value
pixels[20][50][1]; // x=20 y=50 -> G value
pixels[20][50][2]; // x=20 y=50 -> B value
I guess I could simple put it in a loop:
for ( all pixels on Y axis )
{
for ( all pixels in X axis )
{
unsigned char pixels[width][height][3];
glReadPixels(x,y, 1, 1, GL_RGB, GL_UNSIGNED_BYTE, pixels[x][y]);
}
}
But I have the feeling that there must be a much better way to do this. But I do however need my array to be like I described above the code. So would the for loop idea be good, or is there a better way?

glReadPixels simply returns bytes in the order R, G, B, R, G, B, ... (based on your setting of GL_RGB) from the bottom left of the screen going up to the top right. From the OpenGL documentation:
glReadPixels returns pixel data from the frame buffer, starting with
the pixel whose lower left corner is at location (x, y), into client
memory starting at location data. Several parameters control the
processing of the pixel data before it is placed into client memory.
These parameters are set with three commands: glPixelStore,
glPixelTransfer, and glPixelMap. This reference page describes the
effects on glReadPixels of most, but not all of the parameters
specified by these three commands.
The overhead of calling glReadPixels thousands of times will most likely take a noticeable amount of time (depends on the window size, I wouldn't be surprised if the loop took 1-2 seconds).
It is recommended that you only call glReadPixels once and store it in a byte array of size (width - x) * (height - y) * 3. From there you can either reference a pixel's component location with data[(py * width + px) * 3 + component] where px and py are the pixel locations you want to look up, and component being the R, G, or B components of the pixel.
If you absolutely must have it in a 3-dimensional array, you can write some code to rearrange the 1d array after the glReadPixels call.

If you'll define pixel array like: this:
unsigned char pixels[MAX_Y][MAX_X][3];
And the you'll access it like this:
pixels[y][x][0] = r;
pixels[y][x][1] = g;
pixels[y][x][2] = b;
Then you'll be able to read pixels with one glReadPixels call:
glReadPixels(left, top, MAX_Y, MAX_X, GL_RGB, GL_UNSIGNED_BYTE, pixels);

What you can do is declare a simple one dimensional array in a struct and use operator overloading for convenient subscript notation
struct Pixel2d
{
static const int SIZE = 50;
unsigned char& operator()( int nCol, int nRow, int RGB)
{
return pixels[ ( nCol* SIZE + nRow) * 3 + RGB];
}
unsigned char pixels[SIZE * SIZE * 3 ];
};
int main()
{
Pixel2d p2darray;
glReadPixels(50,50, 1, 1, GL_RGB, GL_UNSIGNED_BYTE, &p.pixels);
for( int i = 0; i < Pixel2d::SIZE ; ++i )
{
for( int j = 0; j < Pixel2d::SIZE ; ++j )
{
unsigned char rpixel = p2darray(i , j , 0);
unsigned char gpixel = p2darray(i , j , 1);
unsigned char bpixel = p2darray(i , j , 2);
}
}
}
Here you are reading a 50*50 pixel in one shot and using operator()( int nCol, int nRow, int RGB) operator provides the needed convenience. For performance reasons you don't want to make too many glReadPixels calls

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js