Per-line texture processing accelerated with OpenGL/OpenCL

Per-line texture processing accelerated with OpenGL/OpenCL - opengl

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.

If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.

Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.

Related

Post-processing individual MSAA samples in CPU

I'm interested in sub-pixel sampling my OpenGL renders around the edge silhouettes of my meshes for a computer vision task. I'm thinking of using MSAA to do it efficiently (but the application is not for anti-aliasing). The problem I find with multisampling is that in order to read the samples from the GPU I can only blit the framebuffers into a non multisampling one, thus I cannot recover individual sample information. My questions are:
Is there a way to impelement a fragment shader that stores the results of a per-sample (GL_SAMPLE_SHADING) computation such that I can read those samples back to CPU? I've thought of using glSampleID to index the output to different out buffers but don't know if that's possible at all. Perhaps a method like the linked-list structures used for OIT (i.e. http://on-demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-transparency-opengl.pdf)? However, there they perform all computations on GPU so I'm not sure if I can read the linked list data from the CPU in any way.
Maybe MSAA is the wrong approach and there are other methods to do so. I guess my last resort is to super sample the render x times and thus recover individual samples, but that seems to be a very inefficient solution.

You can write a compute shader which reads the samples and writes each sample's data via imageLoad, and then writes it to an SSBOs (FS outputs and image load/store would not be appropriate for the output). You'll need the usual memory barrier synchronization when it comes time to read it, but this way, you can write directly to a buffer object, rather than having to use a PBO to read from a texture.
The hardest part will be converting gl_GlobalInvocationID and the other compute shader inputs into the index in the SSBO array as well as the texture coordinate and sample index for your imageLoad operation.

OpenGL rendering/updating loop issues

I'm wondering how e.g. graphic (/game) engines do their job with lot's of heterogeneous data while a customized simple rendering loop turns into a nightmare when you have some small changes.
Example:
First, let's say we have some blocks in our scene.
Graphic-Engine: create cubes and move them
Customized: create cube template for vertices, normals, etc. copy and translate them to the position and copy e.g. in a vbo. One glDraw* call does the job.
Second, some weird logic. We want block 1, 4, 7, ... to rotate on x-axis, 2, 5, 8, ... on y-axis and 3, 6, 9 on z-axis with a rotation speed linear to the camera distance.
Graphic-Engine: manipulating object's matrix and it works
Customized: (I think) per object glDraw* call with changing model-matrix uniform is not a good idea, so a translation matrix should be something like an attribute? I have to update them every frame.
Third, a block should disappear if the distance to the camera is lower than any const value Q.
Graphic-Engine: if (object.distance(camera) < Q) scene.drop(object);
Customized: (I think) our vbo is invalid and we have to recreate it?
Again to the very first sentence: it feels like engines do those manipulations for free, while we have to rethink how to provide and update data. And while we do so, the engine (might, but I actually don't know) say: 'update whatever you want, at least I'm going to send all matrizes'.
Another Example: What about a voxel-based world (e.g. Minecraft) where we only draw the visible surface, and we are able to throw a bomb and destroy many voxels. If the world's view data is in one huge buffer we only have one glDraw*-call but have to recreate the buffer every time then. If there are smaller chunks, we have many glDraw*-calls and also have to manipulate buffers, which are smaller.
So is it a good deal to send let's say 10MB of buffer update data instead of 2 gl*-calls with 1MB? How many updates are okay? Should a rendering loop deal with lazy updates?
I'm searching for a guide what a 60fps application should be able to update/draw per frame to get a feeling of what is possible. For my tests, every optimization try is another bottleneck.
And I don't want those tutorials which says: hey there is a new cool gl*Instance call which is super-fast, buuuuut you have to check if your gpu supports it. Well, I also rather consider this an optimization than a meaningful implementation at first.
Do you have any ideas, sources, best practices or rule of thumb how a rendering/updating routine best play together?
My questions are all nearly the same:
How many updates per frame are okay on today's hardware?
Can I lazy-load data to have it after a few frames, but without freezing my application
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
Thank you.

It's unclear how any of your questions relate to your "graphics engines" vs "customized" examples. All the updates you do with a "graphics engines" are translated to those OpenGL calls in the end.
In brief:
How many updates per frame are okay on today's hardware?
Today's PCIe bandwidth is huge (can go as high as 30 GB/s). However, to utilize it in its entirety you have to reduce the number transactions via consolidating OpenGL calls. The exact number of updates entirely depends on the hardware, drivers, and the way you use them, and graphics hardware is diverse.
This is the kind of answer you didn't want to hear, but unfortunately you have to face the truth: to reduce the number of OpenGL calls you have to use the newer version APIs. E.g. instead of setting each uniform individually you are better to submit a bunch of them through uniform shader buffer objects. Instead of submitting each MVP of each model individually, it's better to use instanced rendering. And so on.
An even more radical approach would be to move to a lower-level (and newer) API, i.e. Vulkan, which aims to solve exactly this problem: the cost of submitting work to the GPU.
Can I lazy-load data to have it after a few frames, but without freezing my application
Yes, you can upload buffer objects asynchronously. See Buffer Object Streaming for details.
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
You don't need any of these if you do it asynchronously.

OpenGL constructing and using data on the GPU

I am not a graphics programmer, I use C++ and C mainly, and every time I try to go into OpenGL, every book, and every resource starts like this:
GLfloat Vertices[] = {
some, numbers, here,
some, more, numbers,
numbers, numbers, numbers
};
Or they may even be vec4.
But then you do something like this:
for(int i = 0; i < 10000; i++)
for(int j = 0; j < 10000; j++)
make_vertex();
And you get a problem. That loop is going to take a significant amount of time to finish- and if the make_vertex() function is anything like a saxpy or something of the sort, it is not just a problem... it is a big problem. For example, let us assume I wish to create fractal terrain. For any modern graphic card this would be trivial.
I understand the paradigm goes like this: Write the vertices manually -> Send them over to the GPU -> GPU does vertex processing, geometry, rasterization all the good stuff. I am sure it all makes sense. But why do I have to do the entire 'Send it over' step? Is there no way to skip that entire intermediary step, and just create vertices on the GPU, and draw them, without the obvious bottleneck?
I would very much appreciate at least a point in the right direction.
I also wonder if there is a possible solution without delving into compute shaders or CUDA? Does openGL or GLSL not provide a suitable random function which can be executed in parallel?

I think what you're asking for could work by generating height maps with a compute shader, and mapping that onto a grid with fixed spacing which can be generated trivially. That's a possible solution off the top of my head. You can use GL Compute shaders, OpenCL, or CUDA. Details can be generated with geometry and tessellation shaders.
As for preventing the camera from clipping, you'd probably have to use transform feedback and do a check per frame to see if the direction you're moving in will intersect the geometry.

Your entire question seems to be built on a huge misconception, that vertices are the only things which need to be "crunched" by the GPU.
First, you should understand that GPUs are far more superior than CPUs when it comes to parallelism (heck, GPUs sacrifice conditional control jumping for the sake of parallelism). Second, shaders and these buffers you make are all stored on the GPU after being uploaded by the CPU. The reason you don't just create all vertices on the GPU? It's the same reason for why you load an image from the hard drive instead of creating a raw 2D array and start filling it up with your pixel data inline. Even then, your image would be stored in the executable program file, which is stored on the hard disk and only loaded to memory when you run it. In an actual application, you'll want to load your graphics off assets stored somewhere (usually the hard drive). Why not let the GPU load the assets from the hard drive by itself? The GPU isn't connected to a hardware's storage directly, but barely to the system's main memory via some BUS. That's because to connect to any storage directly, the GPU will have to deal with the file system which is managed by the OS. That's one of the things the CPU would be faster at doing since we're dealing with serialized data.
Now what shaders deal with is this data you upload to the GPU (vertices, texture coordinates, textures..etc). In ancient OpenGL, no one had to write any shaders. Graphics drivers came with a builtin pipeline which handles regular rendering requests for you. You'd provide it with 4 vertices, 4 texture coordinates and a texture among other things (transformation matrices..etc), and it'd draw your graphics for you on the screen. You could go a bit farther and add some lights to your scene and maybe customize a few things about it, but things were still pretty tight. New OpenGL specifications gave more freedom to the developer by allowing them to rewrite parts of the pipeline with shaders. The developer becomes responsible for transforming vertices into place and doing all sort of other calculations related to lighting etc.
I would very much appreciate at least a point in the right direction.
I am guessing it has something to do with uniforms, but really, with
me skipping pages, I really cannot understand how a shader program
runs or what the lifetime of the variables is.
uniforms are variables you can send to the shaders from the CPU every frame before you use it to render graphics. When you use the saturation slider in Photoshop or Gimp, it (probably) sends the saturation factor value to the shader as a uniform of type float. uniforms are what you use to communicate little settings like these to your shaders from your application.
To use a shader program, you first have to set it up. A shader program consists of at least 2 types of shaders linked together, a fragment shader and a vertex shader. You use some OpenGL functions to upload your shader sources to the GPU, issue an order of compilation followed by linking, and it'll give you the program's ID. To use this program, you simply glUseProgram(programId) and everything following this call will use it for drawing. The vertex shader is the code that runs on the vertices you send to position them on the screen correctly. This is where you can do transformations on your geometry like scaling, rotation etc. A fragment shader runs at some stage afterwards using interpolated (transitioned) values outputted from the vertex shader to define the color and the depth of every unit fragment on what you're drawing. This is where you can do post-processing effects on your pixels.
Anyway, I hope I've helped making a few things clearer to you, but I can only tell you that there are no shortcuts. OpenGL has quite a steep learning curve, but it all connects and things start to make sense after a while. If you're getting so bored of books and such, then consider maybe taking code snippets of every lesson, compile them, and start messing around with them while trying to rationalize as you go. You'll have to resort to written documents eventually, but hopefully then things will fit easier into your head when you have some experience with the implementation components. Good luck.
Edit:
If you're trying to generate vertices on the fly using some algorithm, then try looking into Geometry Shaders. They may give you what you want.

You probably want to use CUDA for the things you are used to do in C or C++, and let OpenGL access the rasterizer and other graphics stuff.
OpenGL an CUDA interact somehow nicely. A good entry point to customize the contents of a buffer object is here: http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html#group__CUDART__OPENGL_1g0fd33bea77ca7b1e69d1619caf44214b , with cudaGraphicsGLRegisterBuffer method.
You may also want to have a look at the nbody sample from NVIDIA GPU SDK samples the come with current CUDA installs.

What is the difference between clearing the framebuffer using glClear and simply drawing a rectangle to clear the framebuffer?

I think at least some old graphics drivers used to crash if glClear wasn't used and that glClear is probably faster in a lot of cases but why? How are 3-d graphics drivers usually implemented such that these uses would have different results?

On a high level, it can be faster because the OpenGL implementation knows ahead of time that the whole buffer needs to be set to the same color/value. The more you know about what exactly needs to be done, the more you can take advantage of possible accelerations.
Let's say setting a whole buffer to the same value is more efficient than setting the same pixels to variable values. With a glClear(), you know already that all pixels will have the same value. If you draw a screen sized quad with a fragment shader that emits a constant color, the driver would either have to recognize that situation by analyzing the shaders, or the system would have to compare the values coming out of the shader, to know that all pixels have the same value.
The reason why setting everything to the same value can be more efficient has to do with framebuffer compression and related technologies. GPUs often don't actually write each pixel out to the framebuffer, but use various kinds of compression schemes to reduce the memory bandwidth needed for framebuffer writes. If you imagine almost any kind of compression, all pixels having the same value is very favorable.
To give you some ideas about the published vendor specific technologies, here are a few sources. You can probably find more with a search.
Article talking about new framebuffer compression method in relatively recent AMD cards: http://techreport.com/review/26997/amd-radeon-r9-285-graphics-card-reviewed/2.
NVIDIA patent on zero bandwidth clears: http://www.google.com/patents/US8330766.
Blurb on ARM web site about Mali framebuffer compression: http://www.arm.com/products/multimedia/mali-technologies/arm-frame-buffer-compression.php.

Why is it faster? Because it is a function that bypasses most calculations that other types of drawings have to go through.
Alpha function, blend function, logical operation, stenciling, texture mapping, and depth-buffering are ignored by glClear
Source
Why do some drivers crash without it? It's hard to say, but it should have something to do with the implementation details of OpenGL. The functions does what it's supposed to do, but might do more that you don't know about.
OpenGL might infer from this function call other tasks that it needs to perform.

mipmap generation in opengl - is it hardware accelerated?

The purpose here isn't rendering, but gpgpu; it's for image blurring:
given an image, I need to blur it with a fixed given separable kernel (see e.g. Separable 2D Blur Kernel).
For GPU processing, a good popular method would be to first filter the lines, then filter the columns; and using the vertex shader and the fragment shader to do so (*)
However, if I have a fixed-sized kernel, I think I can use a fast-calculated mipmap that is close to the level I want, and then upsample it (as was suggested here) .
The question is therefore: will an opengl-created mipmap be faster than a mipmap I create myself using the method of (*)?
Put another way: is the mipmap creation optimized on the gpu itself? will it always outperform (speed-wise) user-created glsl code? or would it depend on the graphics card?
Edit:
Thanks for the replies (Kahler, Jean-Simon Brochu). However, I still haven't seen any resources that explicitly say whether mipmaps generation by the gpu is faster than any user-created mipmaps, because of specific mipmap-generation-gpu-hardware...

OpenGL does not care how the functions are implemented.
OpenGL is a set of specifications, among them is the glGenerateMipmap.
Anyone can write a software renderer or develop a video card compliant to the specification. If it pass the tests, it's ~OpenGL certified~
That means that no function is mandatory to be performed on CPU or GPU, or anywhere, they just have to produce the OpenGL expected results.
Now for the practical side:
Nowadays, you can just assume the mipmap generation is done by the video card, because the major-vendors adopted this approach.
If you really want to know, you will have to check specifically to the video card you are programing to.
As for performance, assume you can't beat the video card.
Even if you come up with some highly optimized code performed in some high-tech-full-of-things-CPU, you will have to upload the mipmaps you generated to the GPU, and this operation alone will probably take more time then letting the GPU do the work after you've uploaded the full-resolution texture.
And, if you program the mipmaping as a shader, still unlikely to beat the hard-coded (maybe even hard wired) built-in function. (and that code-alone, not counting the fact that it may schedule better, process apart, etc)
This site explains the glGenerateMipmap history better =))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js