GPU particle metaball-surface rendering - c++

I have a question about a very specific method on how to render surface particles. The method is explained very well in the Nvidia GPU Gems 3 chapter 7 "Point-Based Visualization of Metaballs on a GPU", link to this chapter.
The article is about rendering an implicit surface using points or splats that are evenly distributed over the surface. They say that the computation of these particles is done completely on the GPU. Only the data which defines the surface is sent from CPU to the GPU to keep the traffic as low as possible.
They also gave some pseudo code examples of fragment shader programs to compute the particle positions, velocity etc. and for me it looks like these programs should run once for every particle.
Now my question is, how do they store these particles? What kind of data structure is it?
It must be some kind of buffer or texture that can be accessed for reading as well as for writing operations on the GPU. But how do I render this buffer/texture again in the next rendering step?
My first idea was some kind of vertex-buffer-object which is sent to the GPU once at the beginning and continuously updated there at each rendering pass. Is that possible at all?
One requirement for me is that it must be implemented using OpenGL/GLSL, I hope that is possible.

Yes you need some kind of VBO and repeated passes over the same data. The data structure can be a SoA (Struct of Arrays) or AoS (Array of Structs) depending on how you prefer to code the access to the different properties of the array, ie:
Positions Array
Speed Array
Normal Array
Just one Array containing [Position, Speed, Normal].
AoS are the same as interleaved arrays for rendering where in only one array you keep all the properties of the mesh.
You could use either a VBO or a Texture, the only difference is the way the caching is done, since textures are optimized for 2D access.
The rendering is done in steps exactly like you are picturing it, so all you need to do is to "render" the physical stepping of the system using shaders that compute the properties you want and then bind the same structures to the true graphics rendering in a subsequent step.


Can GLSL handle buffers with arbitrary length?

I have an art application I'm dabbling with that uses OpenGL for accelerated graphics rendering. I'd like to be able to add the ability to draw arbitrary piecewise curves - pretty much the same sort of shapes that can be defined by the SVG 'path' element.
Rather than tessellating my paths into polygons on the CPU, I thought it might be better to pass an array of values in a buffer to my shader defining the pieces of my curve and then using an in/out test to check which pixels were actually inside. In other words, I'd be iterating through a potentially large array of data describing each segment in my path.
From what I remember back when I learned shader programming years ago, GPUs handle if statements by evaluating both branches and then throwing away the branch that wasn't used. This would effectively mean that it would end up silently running through my entire buffer even if I only used a small part of it (i.e., my buffer has the capacity to handle 1024 curve segments, but the simple rectangle I'm drawing only uses the first four of them).
How do I write my code to deal with this variable data? Can modern GPUs handle conditional code like this well?
GPUs can handle arbitrary-length buffers and conditionals (or fake it convincingly). The problem is that a vertex and geometry shaders cannot generate arbitrary number of triangles from a short description.
OpenGL 4.0 added two new types of shaders: Tessellation Control shaders and Tessellation Evaluation shaders. These shaders give you the ability to tessellate curves and surfaces on the GPU.
I found this tutorial to be quite useful in showing how to tessellate Bezier curves on the GPU.

Tiled deferred shading without compute shader

I'm building a deferred renderer and since I want to support a large amount of lights in the scene I've had a look at tiled deferred shading.
The problem is that I have to target OpenGL 3.3 hardware and it doesn't support GLSL compute shaders.
Is there a possibility to implement tiled deferred shading with normal shaders?
Tiled deferred rendering does not strictly require a compute shader. What it requires is that, for each tile, you have a series of lights which it will process. A compute shader is merely one way to accomplish that.
An alternative is to build the light lists for each frustum on the CPU, then upload that data to the GPU for its eventual use. Obviously it requires much more memory work than the CS version. But it's probably not that expensive, and it allows you to easily play with tile sizes to find the most optimal. More tiles means more CPU work and more data to be uploaded, but fewer lights-per-tile (generally speaking) and more efficient processing.
One way to do that for GL 3.3-class hardware is to make each tile a separate quad. The quad will be given, as part of its per-vertex parameters, the starting index for its part of the total light list and the total number of lights for that tile to process. The idea being that there is a globally-accessible array, and each tile has a contiguous region of this array that it will process.
This array could be the actual lights themselves, or it could be indices into a second (much smaller) array of lights. You'll have to measure the difference to tell if it's worthwhile to have the additional indirection in the access.
The primary array should probably be a buffer texture, since it can get quite large, depending on the number of lights and tiles. If you go with the indirect route, then the array of actual light data will likely fit into a uniform block. But in either case, you're going to need to employ buffer streaming techniques when uploading to it.

issues abour shaders and transformations in opengl

If I'm not wrong, shaders are programs that run in GPU, right?
Do we send data to this programs using glUniformMatrix*?
I don't know if it's right but if I send a MVP matrix to the shader, the object's vertices that I want to render will use the position calculated by the shader right before calling the render function.
If I want to render a lot of objects and I must send the MVP matrix then render the object right after, so I will have a code that send to GPU -> render a lot of times. However if I'm not wrong again this is not a good practice because I'm losing performance because the cost of send information to GPU is very expensive. So a way to get a better performance is send all the informations to GPU then render all the objects.
And the questions of 1 million dollars is, How can the shader program identify that the MVP matrix is used by a single object and not another one?
If I'm not wrong, shaders are programs that run in GPU, right?
Possibly. Many implementations of OpenGL have software renderers that they can fall back to if resources on the GPU are constrained. But usually, yes, they're run on the GPU.
Do we send data to this programs using glUniformMatrix*?
That's the usual way. You also set things like texture coordinates either via immediate mode methods like glTexCoord*() (in legacy OpenGL), or via buffer objects.
I don't know if it's right but if I send a MVP matrix to the shader, the object's vertices that I want to render will use the position calculated by the shader right before calling the render function.
There are different types of shaders. A vertex shader is called once for each vertex. A fragment shader is called once per fragment (roughly once per output screen-space pixel that actually gets drawn). Generally you will probably want to send the model, view, and projection matrices separately to the vertex shader. (Or possibly in some combination that lifts some computations out of the shader.) Then you'll multiply each vertex by the appropriate matrix (or combo of matrices).
And there are other types of shaders beyond those, but those 2 are the most common.
If I want to render a lot of objects and I must send the MVP matrix then render the object right after, so I will have a code that send to GPU -> render a lot of times. However if I'm not wrong again this is not a good practice because I'm losing performance because the cost of send information to GPU is very expensive. So a way to get a better performance is send all the informations to GPU then render all the objects.
I wouldn't get overly worried about performance until you have shaders working properly. Performance can be dependent on a lot of different factors. One is how often you send or receive data to or from the GPU and how much data you're transferring. Another is how many passes you do for each shader, and another is the size of your textures, geometry, and other stuff.
And the questions of 1 million dollars is, How can the shader program identify that the MVP matrix is used by a single object and not another one?
The way I've done that in the past is to set the current shader program and uniforms via glUseProgram() and glUniform*(), then upload my geometry for an object, and repeat as necessary for each object or set of objects as needed.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.
I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.

Draw a bunch of elements generated by CUDA/OpenCL?

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)
One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.
Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.