Post-processing individual MSAA samples in CPU - opengl

I'm interested in sub-pixel sampling my OpenGL renders around the edge silhouettes of my meshes for a computer vision task. I'm thinking of using MSAA to do it efficiently (but the application is not for anti-aliasing). The problem I find with multisampling is that in order to read the samples from the GPU I can only blit the framebuffers into a non multisampling one, thus I cannot recover individual sample information. My questions are:
Is there a way to impelement a fragment shader that stores the results of a per-sample (GL_SAMPLE_SHADING) computation such that I can read those samples back to CPU? I've thought of using glSampleID to index the output to different out buffers but don't know if that's possible at all. Perhaps a method like the linked-list structures used for OIT (i.e. http://on-demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-transparency-opengl.pdf)? However, there they perform all computations on GPU so I'm not sure if I can read the linked list data from the CPU in any way.
Maybe MSAA is the wrong approach and there are other methods to do so. I guess my last resort is to super sample the render x times and thus recover individual samples, but that seems to be a very inefficient solution.

You can write a compute shader which reads the samples and writes each sample's data via imageLoad, and then writes it to an SSBOs (FS outputs and image load/store would not be appropriate for the output). You'll need the usual memory barrier synchronization when it comes time to read it, but this way, you can write directly to a buffer object, rather than having to use a PBO to read from a texture.
The hardest part will be converting gl_GlobalInvocationID and the other compute shader inputs into the index in the SSBO array as well as the texture coordinate and sample index for your imageLoad operation.

Related

How to sample Renderbuffer depth information and process it in CPU code, without causing an impact on performance?

I am trying to sample a few fragments' depth data that I need to use in my client code (that runs on CPU).
I tried a glReadPixel() on my FrameBuffer Object, but turns out it stalls the render pipeline as it transfers data from Video Memory to Main Memory through the CPU, thus causes unbearable lag (please, correct me if I am wrong).
I read about Pixel Buffer objects, that we can use them as copies of other buffers, and very importantly, perform glReadPixel() operation without stalling the performance, but not without compromising to use outdated information. (That's OK for me.)
But, I am unable to understand about how to use Pixel Buffers.
What I've learnt is we need to sample data from a texture to store it in a PixelBuffer. But I am trying to sample from a Renderbuffer, which I've read is not possible.
So here's my problem - I want to sample the depth information stored in my Render Buffer, store it in RAM, process it and do other stuff, without causing any issues to the Rendering Pipeline. If I use a depth texture instead of a renderbuffer, i don't know how to use it for depth testing.
Is it possible to copy the entire Renderbuffer to the Pixelbuffer and perform read operations on it?
Is there any other way to achieve what I am trying to do?
Thanks!
glReadPixels can also transfer from a framebuffer to a standard GPU side buffer object. If you generate a buffer and bind it to the GL_PIXEL_PACK_BUFFER target, the data pointer argument to glReadPixels is instead an offset into the buffer object. (So probably should be 0 unless you are doing something clever.)
Once you've copied the pixels you need into a buffer object, you can transfer or map or whatever back to the CPU at a time convenient for you.

Read and Write in one Texture (OpenGL)

I want to store and update informations in a texture. So the idea is, that I create a new texture with current informations. While storing it in the render process I actually want to read the informations out of the same pixel and store a weighted average of both values. So the value that was rendered to that pixel and the value that was already on that pixel.
Now I read very often that I can not read and write on the same texture. Now my questions is, may it maybe be possible? and if not should I copy the texture information, before the rendering step and pass the copy to the shader? If so, how can I copy the texture? or should I do a extra rendering step for copying?
I see two possible options here, depending on the mix equation
Alpha Blending: If the equation used can be mapped to one of the glBlendFunc functions, then this is the way to go. If you want to use linear factors for the stored and the new value this should be possible. This is also the option where I would expect the best performance.
Image Load Store: With this method one can read and write to the same texture at the same time (see here). The performance will usually be very bad here and you will have to use the image atomic operations to ensure that multiple fragments at the same location always read the correct value.
Copying the texture would, in my opinion, only work if you render an image and then perform one weighted average computation on it afterwards (otherwise you would have to copy the texture after each store operation). But if this is the case, one could simple render the result of the average computation to a different texture and completely avoid all the trouble of copying the input data.
If resorting to an extension is an option, you can use NV_texture_barrier which allows writing and reading from the same texture.

OpenGL constructing and using data on the GPU

I am not a graphics programmer, I use C++ and C mainly, and every time I try to go into OpenGL, every book, and every resource starts like this:
GLfloat Vertices[] = {
some, numbers, here,
some, more, numbers,
numbers, numbers, numbers
};
Or they may even be vec4.
But then you do something like this:
for(int i = 0; i < 10000; i++)
for(int j = 0; j < 10000; j++)
make_vertex();
And you get a problem. That loop is going to take a significant amount of time to finish- and if the make_vertex() function is anything like a saxpy or something of the sort, it is not just a problem... it is a big problem. For example, let us assume I wish to create fractal terrain. For any modern graphic card this would be trivial.
I understand the paradigm goes like this: Write the vertices manually -> Send them over to the GPU -> GPU does vertex processing, geometry, rasterization all the good stuff. I am sure it all makes sense. But why do I have to do the entire 'Send it over' step? Is there no way to skip that entire intermediary step, and just create vertices on the GPU, and draw them, without the obvious bottleneck?
I would very much appreciate at least a point in the right direction.
I also wonder if there is a possible solution without delving into compute shaders or CUDA? Does openGL or GLSL not provide a suitable random function which can be executed in parallel?
I think what you're asking for could work by generating height maps with a compute shader, and mapping that onto a grid with fixed spacing which can be generated trivially. That's a possible solution off the top of my head. You can use GL Compute shaders, OpenCL, or CUDA. Details can be generated with geometry and tessellation shaders.
As for preventing the camera from clipping, you'd probably have to use transform feedback and do a check per frame to see if the direction you're moving in will intersect the geometry.
Your entire question seems to be built on a huge misconception, that vertices are the only things which need to be "crunched" by the GPU.
First, you should understand that GPUs are far more superior than CPUs when it comes to parallelism (heck, GPUs sacrifice conditional control jumping for the sake of parallelism). Second, shaders and these buffers you make are all stored on the GPU after being uploaded by the CPU. The reason you don't just create all vertices on the GPU? It's the same reason for why you load an image from the hard drive instead of creating a raw 2D array and start filling it up with your pixel data inline. Even then, your image would be stored in the executable program file, which is stored on the hard disk and only loaded to memory when you run it. In an actual application, you'll want to load your graphics off assets stored somewhere (usually the hard drive). Why not let the GPU load the assets from the hard drive by itself? The GPU isn't connected to a hardware's storage directly, but barely to the system's main memory via some BUS. That's because to connect to any storage directly, the GPU will have to deal with the file system which is managed by the OS. That's one of the things the CPU would be faster at doing since we're dealing with serialized data.
Now what shaders deal with is this data you upload to the GPU (vertices, texture coordinates, textures..etc). In ancient OpenGL, no one had to write any shaders. Graphics drivers came with a builtin pipeline which handles regular rendering requests for you. You'd provide it with 4 vertices, 4 texture coordinates and a texture among other things (transformation matrices..etc), and it'd draw your graphics for you on the screen. You could go a bit farther and add some lights to your scene and maybe customize a few things about it, but things were still pretty tight. New OpenGL specifications gave more freedom to the developer by allowing them to rewrite parts of the pipeline with shaders. The developer becomes responsible for transforming vertices into place and doing all sort of other calculations related to lighting etc.
I would very much appreciate at least a point in the right direction.
I am guessing it has something to do with uniforms, but really, with
me skipping pages, I really cannot understand how a shader program
runs or what the lifetime of the variables is.
uniforms are variables you can send to the shaders from the CPU every frame before you use it to render graphics. When you use the saturation slider in Photoshop or Gimp, it (probably) sends the saturation factor value to the shader as a uniform of type float. uniforms are what you use to communicate little settings like these to your shaders from your application.
To use a shader program, you first have to set it up. A shader program consists of at least 2 types of shaders linked together, a fragment shader and a vertex shader. You use some OpenGL functions to upload your shader sources to the GPU, issue an order of compilation followed by linking, and it'll give you the program's ID. To use this program, you simply glUseProgram(programId) and everything following this call will use it for drawing. The vertex shader is the code that runs on the vertices you send to position them on the screen correctly. This is where you can do transformations on your geometry like scaling, rotation etc. A fragment shader runs at some stage afterwards using interpolated (transitioned) values outputted from the vertex shader to define the color and the depth of every unit fragment on what you're drawing. This is where you can do post-processing effects on your pixels.
Anyway, I hope I've helped making a few things clearer to you, but I can only tell you that there are no shortcuts. OpenGL has quite a steep learning curve, but it all connects and things start to make sense after a while. If you're getting so bored of books and such, then consider maybe taking code snippets of every lesson, compile them, and start messing around with them while trying to rationalize as you go. You'll have to resort to written documents eventually, but hopefully then things will fit easier into your head when you have some experience with the implementation components. Good luck.
Edit:
If you're trying to generate vertices on the fly using some algorithm, then try looking into Geometry Shaders. They may give you what you want.
You probably want to use CUDA for the things you are used to do in C or C++, and let OpenGL access the rasterizer and other graphics stuff.
OpenGL an CUDA interact somehow nicely. A good entry point to customize the contents of a buffer object is here: http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html#group__CUDART__OPENGL_1g0fd33bea77ca7b1e69d1619caf44214b , with cudaGraphicsGLRegisterBuffer method.
You may also want to have a look at the nbody sample from NVIDIA GPU SDK samples the come with current CUDA installs.

Per-line texture processing accelerated with OpenGL/OpenCL

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.

Read Framebuffer-texture like an 1D array

I am doing some gpgpu calculations with GL and want to read my results from the framebuffer.
My framebuffer-texture is logically an 1D array, but I made it 2D to have a bigger area. Now I want to read from any arbitrary pixel in the framebuffer-texture with any given length.
That means all calculations are already done on GPU side and I only need to pass certain data to the cpu that could be aligned over the border of the texture.
Is this possible? If yes is it slower/faster than glReadPixels on the whole image and then cutting out what I need?
EDIT
Of course I know about OpenCL/CUDA but they are not desired because I want my program to run out of the box on (almost) any platform.
Also I know that glReadPixels is very slow and one reason might be that it offers some functionality that I do not need (Operating in 2D). Therefore I asked for a more basic function that might be faster.
Reading the whole framebuffer with glReadPixels just to discard it all except for a few pixels/lines would be grossly inefficient. But glReadPixels lets you specify a rect within the framebuffer, so why not just restrict it to fetching the few rows of interest ? So you maybe end up fetching some extra data at the start and end of the first and last lines fetched, but I suspect the overhead of that is minimal compared with making multiple calls.
Possibly writing your data to the framebuffer in tiles and/or using Morton order might help structure it so a tighter bounding box can be be found and the extra data retrieved minimised.
You can use a pixel buffer object (PBO) to transfer pixel data from the framebuffer to the PBO, then use glMapBufferARB to read the data directly:
http://www.songho.ca/opengl/gl_pbo.html