Calculate sum of pixels written by shader in Unity3D - opengl

For each pixel written by a shader in Unity3D I would like to add it to a global variable somewhere so that I can read it back later. So, if the shader program iterates over 1000 pixels writing the color (1,0.5,0) the result I want back to my program would be (1000,500,0).
The actual calculation is much more complex of course, and very time consuming as it has to be done millions of times, doing calculations using multiple textures. So, I need to take advantage of the parallel computing ability of the GPU.
I read somethings about compute shaders that allows shaders to write numbers that can be read back, but I have a hard time to find any relevant example.
Any pointers would be useful.


Metal Texture updating in a very weird way

I am in the process of trying to write a Monte Carlo Path tracer using Metal. I have the entire pipeline working (almost) correctly. I have some weird banding issues, but that seems like it has more to do with my path tracing logic than with Metal.
For people who may not have experience with path tracer, what it does is generates rays from the camera, bounces it around the scene in random directions for a given depth (in my case, 8), and at every intersection/bounce it shades the ray with that material's color. The end goal is for it to "converge" to a very nice and clean image by averaging out many iterations over and over again. In my code, my metal compute pipeline is run over and over again, with the pipeline representing an iteration.
The way I've structured my compute pipeline is using the following stages:
Generate Rays
Looping 8 times (i.e. bounce the ray around 8 times) :
1 - Compute Ray Intersections and Bounce it off that intersection (recalculate direction)
2 - "Shade" the ray's color based on that intersection
Get the average of all iterations by getting the current texture buffer's color, multiplying it by iteration and then add the ray's color to it, then divide by iteration+1. Then store that new combined_color in the same exact buffer location.
So on an even higher level what my entire Renderer does is:
1 - Do a bunch of ray calculations in a compute shader
2 - Update Buffer (which is the MTKView's drawable)
The problem is that for some reason, what happens is that my texture cycles between 3 different levels of color accumulation and keeps glitching between different colors, almost as if there are three different programs trying to write to the same buffer. This can't be due to race conditions because we're reading and writing from the same buffer location right? How can this be happening?
Here is my system trace of the first few iterations:
As you can see, for the first few iterations, it doesn't render anything for some reason, and they're super fast. I have no clue why that is. Then later, the iterations are very slow. Here is a close-up of that first part:
I've tried outputting just a single iteration's color every time, and it seems perfectly fine. My picture doesn't converge to a clean image (which is what happens after multiple iterations are averaged)
I've tried using semaphores to synchronize things, but all I end up with is a stalled program because I keep waiting for the command buffer and for some reason it is never ready. I think I may just not be getting semaphores. I've tried looking things up and I seem to be doing it right.
Help.. I've been working on this bug for two days. I can't fix it. I've tried everything. I just do not know enough to even begin to discern the problem. Here is a link to the project. The system trace file can be found under System Traces/NaiveIntegrator.trace. I hate to just paste in my code, and I understand this is not recommended at SO, but the problem is I just have no idea where the bug could be. I promise that as soon as the problem is fixed I'll paste the relevant code snippets here.
If you run the code you will see a simple cornell box with a yellow sphere in the middle. If you leave the code running for a while you'll see that the third image in the cycle eventually converges to a decent image (ignoring the banding effect on the floor that is probably irrelevant). But the problem is that the image keeps flickering between 3 different images.

Post-processing individual MSAA samples in CPU

I'm interested in sub-pixel sampling my OpenGL renders around the edge silhouettes of my meshes for a computer vision task. I'm thinking of using MSAA to do it efficiently (but the application is not for anti-aliasing). The problem I find with multisampling is that in order to read the samples from the GPU I can only blit the framebuffers into a non multisampling one, thus I cannot recover individual sample information. My questions are:
Is there a way to impelement a fragment shader that stores the results of a per-sample (GL_SAMPLE_SHADING) computation such that I can read those samples back to CPU? I've thought of using glSampleID to index the output to different out buffers but don't know if that's possible at all. Perhaps a method like the linked-list structures used for OIT (i.e. However, there they perform all computations on GPU so I'm not sure if I can read the linked list data from the CPU in any way.
Maybe MSAA is the wrong approach and there are other methods to do so. I guess my last resort is to super sample the render x times and thus recover individual samples, but that seems to be a very inefficient solution.
You can write a compute shader which reads the samples and writes each sample's data via imageLoad, and then writes it to an SSBOs (FS outputs and image load/store would not be appropriate for the output). You'll need the usual memory barrier synchronization when it comes time to read it, but this way, you can write directly to a buffer object, rather than having to use a PBO to read from a texture.
The hardest part will be converting gl_GlobalInvocationID and the other compute shader inputs into the index in the SSBO array as well as the texture coordinate and sample index for your imageLoad operation.

OpenGL constructing and using data on the GPU

I am not a graphics programmer, I use C++ and C mainly, and every time I try to go into OpenGL, every book, and every resource starts like this:
GLfloat Vertices[] = {
some, numbers, here,
some, more, numbers,
numbers, numbers, numbers
Or they may even be vec4.
But then you do something like this:
for(int i = 0; i < 10000; i++)
for(int j = 0; j < 10000; j++)
And you get a problem. That loop is going to take a significant amount of time to finish- and if the make_vertex() function is anything like a saxpy or something of the sort, it is not just a problem... it is a big problem. For example, let us assume I wish to create fractal terrain. For any modern graphic card this would be trivial.
I understand the paradigm goes like this: Write the vertices manually -> Send them over to the GPU -> GPU does vertex processing, geometry, rasterization all the good stuff. I am sure it all makes sense. But why do I have to do the entire 'Send it over' step? Is there no way to skip that entire intermediary step, and just create vertices on the GPU, and draw them, without the obvious bottleneck?
I would very much appreciate at least a point in the right direction.
I also wonder if there is a possible solution without delving into compute shaders or CUDA? Does openGL or GLSL not provide a suitable random function which can be executed in parallel?
I think what you're asking for could work by generating height maps with a compute shader, and mapping that onto a grid with fixed spacing which can be generated trivially. That's a possible solution off the top of my head. You can use GL Compute shaders, OpenCL, or CUDA. Details can be generated with geometry and tessellation shaders.
As for preventing the camera from clipping, you'd probably have to use transform feedback and do a check per frame to see if the direction you're moving in will intersect the geometry.
Your entire question seems to be built on a huge misconception, that vertices are the only things which need to be "crunched" by the GPU.
First, you should understand that GPUs are far more superior than CPUs when it comes to parallelism (heck, GPUs sacrifice conditional control jumping for the sake of parallelism). Second, shaders and these buffers you make are all stored on the GPU after being uploaded by the CPU. The reason you don't just create all vertices on the GPU? It's the same reason for why you load an image from the hard drive instead of creating a raw 2D array and start filling it up with your pixel data inline. Even then, your image would be stored in the executable program file, which is stored on the hard disk and only loaded to memory when you run it. In an actual application, you'll want to load your graphics off assets stored somewhere (usually the hard drive). Why not let the GPU load the assets from the hard drive by itself? The GPU isn't connected to a hardware's storage directly, but barely to the system's main memory via some BUS. That's because to connect to any storage directly, the GPU will have to deal with the file system which is managed by the OS. That's one of the things the CPU would be faster at doing since we're dealing with serialized data.
Now what shaders deal with is this data you upload to the GPU (vertices, texture coordinates, textures..etc). In ancient OpenGL, no one had to write any shaders. Graphics drivers came with a builtin pipeline which handles regular rendering requests for you. You'd provide it with 4 vertices, 4 texture coordinates and a texture among other things (transformation matrices..etc), and it'd draw your graphics for you on the screen. You could go a bit farther and add some lights to your scene and maybe customize a few things about it, but things were still pretty tight. New OpenGL specifications gave more freedom to the developer by allowing them to rewrite parts of the pipeline with shaders. The developer becomes responsible for transforming vertices into place and doing all sort of other calculations related to lighting etc.
I would very much appreciate at least a point in the right direction.
I am guessing it has something to do with uniforms, but really, with
me skipping pages, I really cannot understand how a shader program
runs or what the lifetime of the variables is.
uniforms are variables you can send to the shaders from the CPU every frame before you use it to render graphics. When you use the saturation slider in Photoshop or Gimp, it (probably) sends the saturation factor value to the shader as a uniform of type float. uniforms are what you use to communicate little settings like these to your shaders from your application.
To use a shader program, you first have to set it up. A shader program consists of at least 2 types of shaders linked together, a fragment shader and a vertex shader. You use some OpenGL functions to upload your shader sources to the GPU, issue an order of compilation followed by linking, and it'll give you the program's ID. To use this program, you simply glUseProgram(programId) and everything following this call will use it for drawing. The vertex shader is the code that runs on the vertices you send to position them on the screen correctly. This is where you can do transformations on your geometry like scaling, rotation etc. A fragment shader runs at some stage afterwards using interpolated (transitioned) values outputted from the vertex shader to define the color and the depth of every unit fragment on what you're drawing. This is where you can do post-processing effects on your pixels.
Anyway, I hope I've helped making a few things clearer to you, but I can only tell you that there are no shortcuts. OpenGL has quite a steep learning curve, but it all connects and things start to make sense after a while. If you're getting so bored of books and such, then consider maybe taking code snippets of every lesson, compile them, and start messing around with them while trying to rationalize as you go. You'll have to resort to written documents eventually, but hopefully then things will fit easier into your head when you have some experience with the implementation components. Good luck.
If you're trying to generate vertices on the fly using some algorithm, then try looking into Geometry Shaders. They may give you what you want.
You probably want to use CUDA for the things you are used to do in C or C++, and let OpenGL access the rasterizer and other graphics stuff.
OpenGL an CUDA interact somehow nicely. A good entry point to customize the contents of a buffer object is here: , with cudaGraphicsGLRegisterBuffer method.
You may also want to have a look at the nbody sample from NVIDIA GPU SDK samples the come with current CUDA installs.

When using Direct 3D, what should be processed in code and what should be processed in HLSL?

I am very new to 3D programming, namely with DirectX. I have been trying to follow tutorials on how to do basic things, and I have been looking at the samples provided by Microsoft. One of the big questions I have had is how to tell what calculations should be done in the actual game code and what calculations should be done in HLSL. I have not been able to understand what should be done where, because it looks like, to me, you could have almost all code pertaining to calculations in your shader file, or you could have it all in the executable code and only send the bear minimum to the pixel and vertex shaders. How can one tell what code should go where? If you need an example, I'll try to find one.
"Code" - CPU code
"HLSL" - GPU code
Basically, you want everything that is pure graphics to happen on the GPU. That is, when the information about what you want to render has been sent to the GPU, it should take over and use that information to generate the final image.
You want to the CPU to say to the GPU "this is what I want to render, and here is everything you need to make it happen" and then make sure to tell the GPU "this is how you render it".
Some examples (not a complete or final list in anyway):
Anything dealing with window opening/closing/resizing
User input from mouse, keyboard
Reading and setting configuration
Generating and updating view matrices
Application logic
Setting up and initializing rendering (textures, buffers etc)
Generating vertex data (position, texture coordinates etc)
Creating graphic entities (triangles, textures, colors etc)
Handling animation (timestepping, swapping buffers)
Sending updated data to the GPU for each frame
Use the view matrices to put things on the right place on the screen
Interpolate from vertex data to fragment data
Shading (usually, this is the most complicated part)
Calculate and write final pixel color

Per-line texture processing accelerated with OpenGL/OpenCL

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.