Get results of GPU calculations back to the CPU program in OpenGL - opengl

Is there a way to get results from a shader running on a GPU back to the program running on the CPU?
I want to generate a polygon mesh from simple voxel data based on a computational costly algorithm on the GPU but I need the result on the CPU for physics calculations.

Define "the results"?
In general, if you're doing GPGPU-style computations with OpenGL, you are going to need to structure your shaders around the needs of a rendering system. Rendering systems are designed to be one-way: data goes into them and an image is produced. Going backwards, having the rendering system produce data, is not generally how rendering systems are structured.
That doesn't mean you can't do it, of course. But you need to architect everything around the limitations of OpenGL.
OpenGL offers a number of hooks where you can write data from certain shader stages. Most of these require specialized hardware
Fragment shader outputs
Any hardware capable of fragment shaders will obviously allow you to write to the current framebuffer you're rendering. Through the use of framebuffer objects and textures with floating-point or integer image formats, you can write pretty much any data you want to a variety of images. Once in a texture, you can simply call glGetTexImage to get the rendered pixel data. Or you can just do glReadPixels to get it if the FBO is still bound. Either way works.
The primary limitations of this method are:
The number of images you can attach to the framebuffer; this limits the amount of data you can write. On pre-GL 3.x hardware, FBOs were typically limited to only 4 images plus a depth/stencil buffer. In 3.x and better hardware, you can expect a minimum of 8 images.
The fact that you're rendering. This means that you need to set up your vertex data to position a triangle exactly where you want it to modify data. This is not a trivial undertaking. It's also difficult to get useful input data, since you typically want each texel to be fairly independent of the other. Structuring your fragment shader around these limitations is difficult. Not impossible, but non-trivial in many cases.
Transform Feedback
This OpenGL 3.0 feature allows the output from the Vertex Processing stage of OpenGL (vertex shader and optional geometry shader) to be captured in one or more buffer objects.
This is much more natural for capturing vertex data that you want to play with or render again. In your case, you'll need to read it back after rendering it, perhaps with a glGetBufferSubData call, or by using glMapBufferRange for reading.
The limitations here are that you generally only can capture 4 output values, where each value is a vec4. There are also some strict layout restrictions. Some OpenGL 3.x and 4.x hardware offers the ability to write data to multiple feedback streams, which can all be written into different buffers.
Image Load/Store
This GL 4.2 feature represents the pinnacle of writing: you can bind an image (a buffer texture, if you want to write to a buffer), and just write to it. There are memory ordering constraints that you need to work within.
It's very flexible, but very complex. Besides the difficulty in using it properly, there are a number of limitations. The number of images you can write to will be fairly limited, perhaps 8 or so. And implementations may have total write limits, so that 8 images to write to may have to be shared by the fragment shader's outputs.
What's more, image outputs are only guaranteed for the fragment shader (and 4.3's compute shaders). That is, hardware is allowed to forbid you from using image load/store on non-FS/CS shader stages.

Related

Can GLSL handle buffers with arbitrary length?

I have an art application I'm dabbling with that uses OpenGL for accelerated graphics rendering. I'd like to be able to add the ability to draw arbitrary piecewise curves - pretty much the same sort of shapes that can be defined by the SVG 'path' element.
Rather than tessellating my paths into polygons on the CPU, I thought it might be better to pass an array of values in a buffer to my shader defining the pieces of my curve and then using an in/out test to check which pixels were actually inside. In other words, I'd be iterating through a potentially large array of data describing each segment in my path.
From what I remember back when I learned shader programming years ago, GPUs handle if statements by evaluating both branches and then throwing away the branch that wasn't used. This would effectively mean that it would end up silently running through my entire buffer even if I only used a small part of it (i.e., my buffer has the capacity to handle 1024 curve segments, but the simple rectangle I'm drawing only uses the first four of them).
How do I write my code to deal with this variable data? Can modern GPUs handle conditional code like this well?
GPUs can handle arbitrary-length buffers and conditionals (or fake it convincingly). The problem is that a vertex and geometry shaders cannot generate arbitrary number of triangles from a short description.
OpenGL 4.0 added two new types of shaders: Tessellation Control shaders and Tessellation Evaluation shaders. These shaders give you the ability to tessellate curves and surfaces on the GPU.
I found this tutorial to be quite useful in showing how to tessellate Bezier curves on the GPU.

Writing to depth buffer from opengl compute shader

Generally on modern desktop OpenGL hardware what is the best way to fill a depth buffer from a compute shader and then use that depth buffer for graphics pipeline rendering with triangles etc?
Specifically I am wondering about concerns regards HiZ. Also I wonder if it's better to do compute shader modifications to the depth buffer before or after the graphics rendering?
If the compute shader is run after the graphics rendering I assume the depth buffer will typically be decompressed behind the scenes. But I worry done the other way around the depth buffer may be in a decompressed/non-optimal state for the graphics pipeline?
As far as i know, you cannot bind textures with any of the depth formats as images, and thus cannot write to depth format textures in compute shaders. See glBindImageTexture documentation, it lists the formats that your texture format must be compatible to. Depth formats are not among them and the specification says the depth formats are not compatible to the normal formats.
Texture copying functions have the same compatibility restrictions, so you can't even e.g. write to a normal texture in the compute shader and then copy to a depth texture. glCopyImageSubData does not explicitly have that restriction but i haven't tried it and it's not part of the core profile anymore.
What might work is writing to a normal texture, then rendering a fullscreen triangle and setting gl_FragDepth to values read from the texture, but that's an additional fullscreen pass.
I don't quite understand your second question - if your compute shader stuff modifies the depth buffer, the result will most likely be different depending on whether you do it before or after regular rendering because different parts will be visible or occluded.
But maybe that question is moot since it seems you cannot manually write into depth buffers at all - which might also answer your third question - by not writing into depth buffers you cannot mess with the compression of it :)
Please note that i'm no expert in this, i had a similar problem and looked at the docs/spec myself, so this all might be wrong :) Please let me know if you manage to write to depth buffers with compute shaders!

OpenGL what does glTexImage2D do?

What does gl.glTexImage2D do? The docs say it "uploads texture data". But does this mean the whole image is in GPU memory? I'd like to use one large image file for texture mapping. Further: can I simply use a VBO for uv and position coordinates to draw the texture?
Right, I am using words the wrong way here. What I meant was carrying a 2D array of UV coordinates and a 2D array of model to subsample a larger PNG image (in texture memory) onto individual tile models. My confusion here lies in not knowing how fast these fetches can take. Lets say I have a 5000x5000 pixel image. I load it as a texture. Then I create my own algorithm for fetching portions of it to draw. Where do I save myself the bandwidth for drawing these tiles? If I implement an LOD algorithm to determine which tiles are close, which are far and which are out of the camera frustum how do manage each these tiles in memory? Loaded question I know but I am struggling to find the best implementation to get started. I am developing for mobile devices with OpenGL ES 2.0.
What exactly happens when you call glTexImage2D() is system dependent, and there's no way for you to know, unless you have developer tools that allow you to track GPU and memory usage.
The only thing guaranteed is that the data you pass to the call has been consumed by the time the call returns (since the API definition allows you to modify/free the data after the call), and that the data is accessible to the GPU when it's used for rendering. Between that, anything is fair game. Keep in mind that OpenGL is a very asynchronous API. When you make API calls, the corresponding work is mostly queued up for later execution by the GPU, and is generally not completed by the time the calls return. This can include calls for uploading data.
Also, not all GPUs have "GPU memory". In fact, if you look at them by quantity, very few of them do. Mobile GPUs have caches, but mostly not VRAM in the sense of traditional discrete GPUs. How VRAM and caches are managed is highly system dependent.
With all the caveats above, and picturing a GPU that has VRAM: While it's possible that they can load the data into VRAM in the glTexImage2D() call, I would be surprised if that was commonly done. It just wouldn't make much sense to me. When a texture is loaded, you have no idea how soon it will be used for rendering. Since you don't know if all textures will fit in VRAM (and they often will not), you might have to evict it from VRAM before it was ever used. Which would obviously be very wasteful. As a general strategy, I think it will be much more efficient to load the texture data into VRAM only when you have a draw call that uses it.
Things would be somewhat different if the driver could be very confident that all texture data will fit in VRAM. But with OpenGL, there's really no reasonable way to know this ahead of time. And things get even more complicated since at least on desktop computers, you can have multiple applications running at the same time, while VRAM is a shared resource.
You are correct.
glteximage2d is the function that actually moves the texture data across to the gpu.
you will need to create the texture object first using glGenTextures() and then bind it using glBindTexture().
there is a good example of this process in the opengl redbook
example
you can then use this texture with a VBO. There are many ways to accomplish this, but interleaving your vertex coordinates, texture coordinates, and vertex normals and then telling the GPU how to unpack them with several calls to glVertexAttribPointer is the best bet as far as performance.
you are on the right track with VBOs, the old fixed pipeline GL stuff is depricated so you should just learn VBO from the outset.
this book is not 100% up to date, but it is complete and free and should serve as a great place to start learning VBO Open GL Book

Is it possible to write a bunch of pixels in gl_FragColor?

Has anyone familiar with some sort of OpenGL magic to get rid of calculating bunch of pixels in fragment shader instead of only 1? Especially this issue is hot for OpenGL ES in fact meanwile flaws mobile platforms and necessary of doing things in more accurate (in performance meaning) way on it.
Are any conclusions or ideas out there?
P.S. it's known shader due to GPU architecture organisation is run in parallel for each texture monad. But maybe there techniques to raise it from one pixel to a group of ones or to implement your own glTexture organisation. A lot of work could be done faster this way within GPU.
OpenGL does not support writing to multiple fragments (meaning with distinct coordinates) in a shader, for good reason, it would obstruct the GPUs ability to compute each fragment in parallel, which is its greatest strength.
The structure of shaders may appear weird at first because an entire program is written for only one vertex or fragment. You might wonder why can't you "see" what is going on in neighboring parts?
The reason is an instance of the shader program runs for each output fragment, on each core/thread simultaneously, so they must all be independent of one another.
Parallel, independent, processing allows GPUs to render quickly, because the total time to process a batch of pixels is only as long as the single most intensive pixel.
Adding outputs with differing coordinates greatly complicates this.
Suppose a single fragment was written to by two or more instances of a shader.
To ensure correct results, the GPU can either assign one to be an authority and ignore the other (how does it know which will write?)
Or you can add a mutex, and have one wait around for the other to finish.
The other option is to allow a race condition regarding whichever one finishes first.
Either way this would immensely slows down the process, make the shaders ugly, and introduces incorrect and unpredictable behaviour.
Well firstly you can calculate multiple outputs from a single fragment shader in OpenGL 3 and up. A framebuffer object can have more than one RGBA surfaces (Renderbuffer Objects) attached and generate an RGBA for each of them by using gl_FragData[n] instead of gl_FragColor. See chapter 8 of the 5th edition OpenGL SuperBible.
However, the multiple outputs can only be generated for the same X,Y pixel coordinates in each buffer. This is for the same reason that an older style fragment shader can only generate one output, and can't change gl_FragCoord. OpenGL guarantees that in rendering any primitive, one and only one fragment shader will write to any X,Y pixel in the destination framebuffer(s).
If a fragment shader could generate multiple pixel values at different X,Y coords, it might try to write to the same destination pixel as another execution of the same fragment shader. Same if the fragment shader could change the pixel X or Y. This is the classic multiple threads trying to update shared memory problem.
One way to solve it would be to say "if this happens, the results are unpredictable" which sucks from the programmer point of view because it's completely out of your control. Or fragment shaders would have to lock the pixels they are updating, which would make GPUs far more complicated and expensive, and the performance would suck. Or fragment shaders would execute in some defined order (eg top left to bottom right) instead of in parallel, which wouldn't need locks but the performance would suck even more.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.
I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.