Adapting an OpenCL kernel to OpenGL-only - opengl

I would like to figure out if there's a way to adapt the following OpenCL approach to OpenGL only. What this OpenCL kernel does is go through several buffers generated by the host and copied to the device memory to know what graphics to render to the OpenGL texture srgb. The CPU generates those buffers for each frame to render, enqueues the copying of those buffers to the GPU and enqueues the execution of the following kernel, which writes to the OpenGL texture while it is temporarily owned by OpenCL, and then naturally when this is all done OpenGL displays the texture on the screen. An iteration of this OpenCL kernel focuses entirely on fully generating and writing a single pixel in one pass only, it very much operates on a per-pixel basis and for every pixel of the texture in the same way with the same parameters and data to work with.
kernel void draw_queue_srgb_kernel(global float *paramlist, global int *poslist, global int *entrylist, global uchar *data_cl, write_only image2d_t srgb, const int sector_w, const int sector_size)
{
const int2 p = (int2) (get_global_id(0), get_global_id(1));
float4 pv; // pixel value (linear)
// this computes the pixel value
pv = draw_queue(paramlist, poslist, entrylist, data_cl, sector_w, sector_size);
// this writes the pixel value to the texture
write_imagef(srgb, p, linear_to_srgb(pv));
}
What is the best way to have a similar approach work only with OpenGL? I need the flexibility to provide arrays that are not textures in any way (they're more similar to lists of indices and parameters used for drawing) and to do a lot of math for each pixel (I use sqrt(), sin(), exp() and erf() a lot for instance), on the other hand every pixel of the texture is treated the same way once, and I'm writing to an OpenGL texture in the usual 8-bit/channel RGB(A) format, so nothing very fancy.
In case you're wondering why I want to ditch OpenCL it's for two reasons, graphics drivers tend to handle running kernels with pauses at regular intervals (like 60 FPS) in ways that are wasteful of the CPU (using as much CPU for the same load no matter if you run at 400 FPS or slow it down to 25 because of wasteful polling in the driver threads), but more importantly because I'm using emscripten to make JavaScript builds and since WebCL support is pretty much nonexistent and not foreseeably happening it would be best to stick strictly to WebGL. And it just seems like maybe what I want to do would be better done with OpenGL.

Related

Post-processing individual MSAA samples in CPU

I'm interested in sub-pixel sampling my OpenGL renders around the edge silhouettes of my meshes for a computer vision task. I'm thinking of using MSAA to do it efficiently (but the application is not for anti-aliasing). The problem I find with multisampling is that in order to read the samples from the GPU I can only blit the framebuffers into a non multisampling one, thus I cannot recover individual sample information. My questions are:
Is there a way to impelement a fragment shader that stores the results of a per-sample (GL_SAMPLE_SHADING) computation such that I can read those samples back to CPU? I've thought of using glSampleID to index the output to different out buffers but don't know if that's possible at all. Perhaps a method like the linked-list structures used for OIT (i.e. http://on-demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-transparency-opengl.pdf)? However, there they perform all computations on GPU so I'm not sure if I can read the linked list data from the CPU in any way.
Maybe MSAA is the wrong approach and there are other methods to do so. I guess my last resort is to super sample the render x times and thus recover individual samples, but that seems to be a very inefficient solution.
You can write a compute shader which reads the samples and writes each sample's data via imageLoad, and then writes it to an SSBOs (FS outputs and image load/store would not be appropriate for the output). You'll need the usual memory barrier synchronization when it comes time to read it, but this way, you can write directly to a buffer object, rather than having to use a PBO to read from a texture.
The hardest part will be converting gl_GlobalInvocationID and the other compute shader inputs into the index in the SSBO array as well as the texture coordinate and sample index for your imageLoad operation.

Why is glReadPixels so slow and are there any alternative?

I need to take sceenshots at every frame and I need very high performance (I'm using freeGlut). What I figured out is that it can be done like this inside glutIdleFunc(thisCallbackFunction)
GLubyte *data = (GLubyte *)malloc(3 * m_screenWidth * m_screenHeight);
glReadPixels(0, 0, m_screenWidth, m_screenHeight, GL_RGB, GL_UNSIGNED_BYTE, data);
// and I can access pixel values like this: data[3*(x*512 + y) + color] or whatever
It does work indeed but I have a huge issue with it, it's really slow. When my window is 512x512 it runs no faster than 90 frames per second when only cube is being rendered, without these two lines it runs at 6500 FPS! If we compare it to irrlicht graphics engine, there I can do this
// irrlicht code
video::IImage *screenShot = driver->createScreenShot();
const uint8_t *data = (uint8_t*)screenShot->lock();
// I can access pixel values from data in a similar manner here
and 512x512 window runs at 400 FPS even with a huge mesh (Quake 3 Map) loaded! Take into account that I'm using openGL as driver inside irrlicht. To my inexperienced eye it seems like glReadPixels is copying every pixel data from one place to another while (uint8_t*)screenShot->lock() is just copying a pointer to already existent array. Can I do something similar to latter using freeGlut? I expect it to be faster than irrlicht.
Note that irrlicht uses openGL too (well it offers directX and other options as well but in the example I gave above I used openGL and by the way it was the fastest compared to other options)
OpenGL methods are used to manage the rendering pipeline. In its nature, while the graphics card is showing image to the viewer, computations of the next frame are being done. When you call glReadPixels; graphics card wait for the current frame to be done, reads the pixels and then starts computing the next frame. Therefore pipeline becomes stalled and becomes sequential.
If you can hold two buffers and tell to the graphics card to read data into these buffers interchanging each frame; then you can read-back from your buffer 1-frame late but without stalling the pipeline. This is called double buffering. You can also do triple buffering with 2 frame late read-back and so on.
There is a relatively old web page describing the phenomenon and implementation here: http://www.songho.ca/opengl/gl_pbo.html
Also there are a lot of tutorials about framebuffers and rendering into a texture on the web. One of them is here: http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-14-render-to-texture/

What is the difference between clearing the framebuffer using glClear and simply drawing a rectangle to clear the framebuffer?

I think at least some old graphics drivers used to crash if glClear wasn't used and that glClear is probably faster in a lot of cases but why? How are 3-d graphics drivers usually implemented such that these uses would have different results?
On a high level, it can be faster because the OpenGL implementation knows ahead of time that the whole buffer needs to be set to the same color/value. The more you know about what exactly needs to be done, the more you can take advantage of possible accelerations.
Let's say setting a whole buffer to the same value is more efficient than setting the same pixels to variable values. With a glClear(), you know already that all pixels will have the same value. If you draw a screen sized quad with a fragment shader that emits a constant color, the driver would either have to recognize that situation by analyzing the shaders, or the system would have to compare the values coming out of the shader, to know that all pixels have the same value.
The reason why setting everything to the same value can be more efficient has to do with framebuffer compression and related technologies. GPUs often don't actually write each pixel out to the framebuffer, but use various kinds of compression schemes to reduce the memory bandwidth needed for framebuffer writes. If you imagine almost any kind of compression, all pixels having the same value is very favorable.
To give you some ideas about the published vendor specific technologies, here are a few sources. You can probably find more with a search.
Article talking about new framebuffer compression method in relatively recent AMD cards: http://techreport.com/review/26997/amd-radeon-r9-285-graphics-card-reviewed/2.
NVIDIA patent on zero bandwidth clears: http://www.google.com/patents/US8330766.
Blurb on ARM web site about Mali framebuffer compression: http://www.arm.com/products/multimedia/mali-technologies/arm-frame-buffer-compression.php.
Why is it faster? Because it is a function that bypasses most calculations that other types of drawings have to go through.
Alpha function, blend function, logical operation, stenciling, texture mapping, and depth-buffering are ignored by glClear
Source
Why do some drivers crash without it? It's hard to say, but it should have something to do with the implementation details of OpenGL. The functions does what it's supposed to do, but might do more that you don't know about.
OpenGL might infer from this function call other tasks that it needs to perform.

GPU programming for image processing

I'm working on a project aimed to control a bipad humanoid robot. Unfortunately we have a very limited set of hardware resources (a RB110 board and its mini PCI graphic card). I'm planning to port image processing tasks from CPU to graphic card processor of possible but never done it before... I'm advised to use OpenCV but seems to impossible because our graphic card processor (Volari Z9s) is not supported by framework. Then I found an interesting post on Linux Journal. Author have used OpenGL to process frames retrieved from a v4l device.
I'm a little confused about the relationship between hardware API and OpenGL/OpenCV. In order to utilize a GPU, do the hardware need to be sopported by graphic programming frameworks (OpenGL/OpenCV)? Where can I find such an API?
I googled a lot about my hardware, unfortunately the vendor (XGI Technology) seems to be somehow extinct...
In order to utilize a GPU, do the hardware need to be sopported by graphic programming frameworks (OpenGL/OpenCV)? Where can I find such an API?
OpenCL and OpenGL are both translated to hardware instructions by the GPU driver, so you need a driver for your operating system that supports these frameworks. Most GPU drivers support some version of OpenGL so that should work.
The OpenGL standard is maintained by the Khronos Group and you can find some tutorials at nehe.
How OpenGL works
OpenGL accepts triangles as input and draws them according to the state it has when the draw is issued. Most OpenGL functions are there to change the operations performed by manipulating this state. Image manipulation can be done by loading the input image as a texture and drawing several vertices with the texture active, resulting in a new Image (or more generic a new 2D grid of data).
From version > 2 (or with the right ARB extensions) the operations performed on the image can be controlled with GLSL programs called vertex and fragment shaders (there are more shaders, but these are the oldest). A vertex shader will be called once per vertex, the results of this are interpolated and forwarded to the fragment shader. A fragment shader will be called every time a new fragment(pixel) is written to the result.
Now this is all about reading and writing images, how to use it for object detection?
Use Vertices to span the input texture over the whole viewport. Instead of computing rgb colors and storing them in the result you can write a fragmentshader that computes grayscale images / gradient images and then checks these textures for each pixel if the pixel is in the center of a cycle with a specific size, part of a line or just has a relatively high gradient compared to its surrounding (good feature) or really anithing else you can find a good parallel algorithm for. (haven't done this myself)
The end result has to be read back to the cpu (sometimes you can use shaders to scale the data down before doing this). OpenCL gives it a less Graphics like feel and gives a lot more freedom but is less supported.
First of all You need shader support (GLSL or asm)
Usual way will be rendering full screen quad with your image (texture) and applying fragment shader. It's called Post-Processing And limited with instruction set and another limitations that your hardware has. On basic lvl it allows you to apply simple (single function) on large data set in parallel way that will produce another data set. But branching (if it is supported) is first performance enemy because GPU consist from couple SIMD blocks

What is the most efficient process to push YUV texture data onto a GPU in OpenGL?

Does anyone know of an efficient way to push 2vuy non-planar data onto a GPU in a way that doesn't require swizzling?
I am grabbing the raw 2vuy data from an h264 video file and successfully loading it into a texture that I map to an an OpenGL object. I notice that my code spends a fair amount of time in glgProcessPixelsWithProcessor. My glTexImage2D call looks like the following:
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_YCBCR_422_APPLE,
GL_UNSIGNED_SHORT_8_8_APPLE, data);
Apple says in its OpenGL guide that GL_YCBCR_422_APPLE, provides "acceptable" performance (p103), but that
Note: If your data needs only to be swizzled, glgProcessPixels performs the swizzling reasonably fast although not as fast as if the data didn't need swizzling. But non-native data formats are converted one byte at a time and incurs a performance cost that is best to avoid.
I assume that there is some kind of internal format conversion going on the CPU. I noticed in another thread that glgProcessPixels is running a block method as well.
Is my path the most efficient? If not, what is?
Your code, as it stands right now depends on extensions of Apple. I can't tell what's happening inside.
However what I suggest is, that you create three 2D textures, each with exactly one channel, where each texture receives one of the color planes; using independent textures makes supporting chroma subsampling (that 422) simpler.
In a shader you'd then perform the colorspace conversion. When writing down the math I suggest you do this via a contact color space, like XYZ, as this allows you, to take the color profile of the output device into account; ICC profiles provide the conversion data from XYZ color space coordinates to device color space (RGB) coordinates.