I am trying to make a particle system where instead of a texture, a quad is rendered by a fragment shader such as below.
uniform vec3 color;
uniform float radius;
uniform float edge;
uniform vec2 position;
uniform float alpha;
void main()
{
float dist = distance(gl_FragCoord.xy, position);
float intensity = smoothstep(dist-edge, dist+edge, radius);
gl_FragColor = vec4(color, intensity*alpha);
}
Each particle is an object of a c++ class that wraps this shader and all the variables together and draws it. I use openFrameworks so the exact openGL calls are hidden from me.
I've read that usually particle systems are done with textures, however, I prefer to do it like this because this way I can add more functionality to the particles. The problem is that after only 30 particles, the framerate drops dramatically. Is there a more efficient way of doing this? I was thinking of maybe putting the variables for each particle into an array and sending these arrays into one fragment shader that then renders all particles in one go. But this would mean that the amount of particles would be fixed because the uniform arrays in the shader would have to be declared beforehand.
Are non-texture-based particle systems simply too inefficient to be realistic, or is there a way of designing this that I'm overlooking?
The reason textures are used is because you can move the particles using the GPU, which is very fast. You'd double buffer a texture which stores particle attributes (like position) per texel, and ping-pong data between them, using a framebuffer object to draw to them and the fragment shader to do the computation, rendering a full screen polygon. Then you'd draw an array of quads and read the texture to get the positions.
Instead of a texture storing attributes you could pack them directly into your VBO data. This gets complicated because you have multiple vertices per particle, but can still be done a number of ways. glVertexBindingDivisor (requires instancing), drawing points, or using the geometry shader come to mind. Transform feedback or image_load_store could be used to update VBOs with the GPU instead of textures.
If you move particles with the CPU, you also need to copy the data to the GPU every frame. This is quite slow, but nothing like 30 particles being a problem slow. This is probably to do with the number of draw calls. Each time you draw something there's a tonne of stuff GL does to set up the operation. Setting uniform values per-primitive (nearly) is very expensive for the same reason. Particles work well if you have arrays of data that gets processed by a manager all at once. They parallelize very well in such cases. Generally their computation is cheap and it all comes down to minimizing memory and keeping good locality.
If you want to keep particle updating CPU side, I'd go with this:
Create a VBO full of -1 to 1 quads (two triangles, 6 verts) and element array buffer to draw them. This buffer will remain static in GPU memory and is what you use to draw the particles all at once with a single draw call.
Create a texture (could be 1D) or VBO (if you choose one of the above methods) that contains positions and particle attributes that update pretty much every frame (using glTexImage1D/glBufferData/glMapBuffer).
Create another texture with particle attributes that rarely update (e.g. only when you spawn them). You can send updates with glTexSubImage1D/glBufferSubData/glMapBufferRange.
When you draw the particles, read position and other attributes from the texture (or attributes if you used VBOs) and use the -1 to 1 quads in the main geometry VBO as offsets to your position.
Related
I understand how you would do this with a 2D buffer. Just draw two triangles that make a quad that fully encompass the 2D buffer space. That way when the fragment shader runs it runs for all the pixels in the buffer.
Question: How would this work for a 3D buffer?
You could just write a lot of triangles for each cross-section of the 3D buffer. However, if you had a texture that was 1x1x256 that would mean that you would need to draw 256*2 triangles for each slice to iterate over all of the pixels. I know this is an extreme case and there are ways of optimizing this solution. However, I feel like there is a more elegant solution that I am missing.
What I am trying to do: I am trying to make a 3D fluid solver that iterates through each of the pixels of the 3D texture and computes its velocity, density, etc. I am trying to do this via the fragment shader because I am using OpenGL 3.0 which does not use compute shaders.
#version 330 core
out vec4 FragColor;
uniform sampler3D volume;
void main()
{
// computing the fluid density, velocity, and center of mass
// output the values to the 3D buffer to diffrent color channels:
fragColor = vec4(density, velocity.xy, centerOfMass);
}
At some point in the fragment shader, you're going to write some statement of the form:
vec4 value = texture(my_texture, TexCoords);
Where TexCoords is the location in my_texture that maps to some particular value in the source texture. But... that mapping is entirely up to you. Nobody's making you use gl_FragCoord.xy / textureSize(my_texture). You could just as easily use vec3(gl_FragCoord.x, Y_value, gl_FragCoord.y) / textureSize(my_texture), which puts the Y component of the fragment location in the Z dimension of the texture. Y_value in this case is a value passed from the outside that tells which vertical slice of the 3D texture to use.
Of course, whatever mapping you use to fetch the data must also be used when you write the data. If you're writing via fragment shader outputs, that poses a problem. A 3D texture can only be attached to an FBO as either a single 2D slice or as a layered set of 2D slices, with these slices always being along the Z dimension of the image. So even if you try to read in slices along the Y dimension, it has to be written in Z slices. So you'd be moving around the location of the data, which makes this non-viable.
If you're using image load/store, then you have no problem. You can just write to the appropriate texel (indeed, you can read from it as an image using integer coordinates, so there's no need to divide by the texture's size).
I'm using OpenGL to draw a basic UI with knobs and buttons, which I'm trying to performance optimize as much as possible. The UI only needs to update on user input, but at the moment I'm updating it # 60fps. Because of this, I want to keep the amount of OpenGL calls per frame to an absolute minimum.
The knobs are mapped as textures onto squares in an 2D orthogonal projection, via a fragment shader. The rotation of the knob is stored as a uniform float rotation in the shader.
Everything works just fine after I set the uniform value with glUniform1f(…); and call glDrawArrays(GL_TRIANGLES, …); but it seems wasteful to redraw the triangles every time since the position will be the same every time.
Ideally, I would just draw the geometry once, and only update the uniform values with glUniform() calls every time a value changed - but when I remove glDrawArrays from my render loop and update the uniform values before triggering a glFlush(), the fragment shaders stay the same.
Is this by design, and am I supposed to redraw the triangles to trigger a repaint of the fragment shader? Or is there a more performant workaround?
I'm writing/planning a GUI renderer for my OpenGL (core profile) game engine, and I'm not completely sure how I should be representing the vertex data for my quads. So far, I've thought of 2 possible solutions:
1) The straightforward way, every GuiElement keeps track of it's own vertex array object, containing 2d screen co-ordinates and texture co-ordinates, and is updated (glBufferSubData()) any time the GuiElement is moved or resized.
2) I globally store a single vertex array object, whose co-ordinates are (0,0)(1,0)(0,1)(1,1), and upload a rect as a vec4 uniform (x, y, w, h) every frame, and transform the vertex positions in the vertex shader (vertex.xy *= guiRect.zw; vertex.xy += guiRect.xy;).
I know that method #2 works, but I want to know which one is better.
I do like the idea of option two, however, it would be quite inefficient because it requires a draw call for each element. As was mentioned by other replies, the biggest performance gains lie in batching geometry and reducing the number of draw calls. (In other words, reducing the time your application spends communicating with the GL driver).
So I think the fastest possible way of drawing 2D objects with OpenGL is by using a technique similar to your option one, but adding batching to it.
The smallest possible vertex format you need in order to draw a quadrilateral on the screen is a simple vec2, with 4 vec2s per quadrilateral. The texture coordinates can be generated in a very lightweight vertex shader, such as this:
// xy = vertex position in normalized device coordinates ([-1,+1] range).
attribute vec2 vertexPositionNDC;
varying vec2 vTexCoords;
const vec2 scale = vec2(0.5, 0.5);
void main()
{
vTexCoords = vertexPositionNDC * scale + scale; // scale vertex attribute to [0,1] range
gl_Position = vec4(vertexPositionNDC, 0.0, 1.0);
}
In the application side, you can set up a double buffer to optimize throughput, by using two vertex buffers, so you can write to one of them on a given frame then flip the buffers and send it to GL, while you start writing to the next buffer right away:
// Update:
GLuint vbo = vbos[currentVBO];
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferSubData(GL_ARRAY_BUFFER, dataOffset, dataSize, data);
// Draw:
glDrawElements(...);
// Flip the buffers:
currentVBO = (currentVBO + 1) % NUM_BUFFERS;
Or another simpler option is to use a single buffer, but allocate new storage on every submission, to avoid blocking, like so:
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, dataSize, data, GL_STREAM_DRAW);
This is a well known and used technique for simple async data transfers. Read this for more.
It is also a good idea to use indexed geometry. Keep an index buffer of unsigned shorts with the vertex buffer. A 2-byte per element IB will reduce data traffic quite a bit and should have an index range big enough for any amount of 2D/UI elements that you might wish to draw.
For GUI elements you could use dynamic vertex buffer (ring buffer) and just upload the geometry every frame because this is quite small amount of geometry data. Then you can batch your GUI element rendering unlike in both of your proposed methods.
Batching is quite important if you render large number of GUI elements, such as text. You can quite easily build a generic GUI rendering system with this which caches the GUI element draw calls and flushes the draws to the GPU upon state changes.
I would recommend doing it like DXUT does it, where it takes the rects from each element, and renders them with a single universal method that takes an element as a parameter, which contains a rect. Each control can have many elements. It adds the four points of the rect to a buffer in a specific order in STREAM_DRAW mode and a constant index buffer. This does draw each rect individually, but performance is not completely vital, because your geometry is simple, and when you are in a dialog, you can usually put the rendering of the 3d scene on the back burner. EDIT: even using this to do HUD items, it has a negligible performance penalty.
This is a simple and organized way to do it, where it works well with textures, and there are only two shaders, one for drawing textured components, and one for non-textured. THen there is a special way to do text.
If you want to see how I did it, you can look at this:
https://github.com/kevinmackenzie/ObjGLUF
It is in GLUFGui.h/.cpp
I am working on a project that requires drawing a lot of data as it is acquired by an ADC...something like 50,000 lines per frame on a monitor 1600 pixels wide. It runs great on a system with a 2007-ish Quadro FX 570, but basically can't keep up with the data on machines with Intel HD 4000 class chips. The data load is 32 channels of 200 Hz data received in batches of 5 samples per channel 40 times per second. So, in other words, the card only needs to achieve 40 frames per second or better.
I am using a single VBO for all 32 channels with space for 10,000 vertices each. The VBO is essentially treated like a series of ring buffers for each channel. When the data comes in, I decimate it based on the time scale being used. So, basically, it tracks the min/max for each channel. When enough data has been received for a single pixel column, it sets the next two vertices in the VBO for each channel and renders a new frame.
I use glMapBuffer() to access the data once, update all of the channels, use glUnmapBuffer, and then render as necessary.
I manually calculate the transform matrix ahead of time (using an orthographic transform calculated in a non-generic way to reduce multiplications), and the vertex shader looks like:
#version 120
varying vec4 _outColor;
uniform vec4 _lBound=vec4(-1.0);
uniform vec4 _uBound=vec4(1.0);
uniform mat4 _xform=mat4(1.0);
attribute vec2 _inPos;
attribute vec4 _inColor;
void main()
{
gl_Position=clamp(_xform*vec4(_inPos, 0.0, 1.0), _lBound, _uBound);
_outColor=_inColor;
}
The _lBound, _uBound, and _xform uniforms are updated once per channel. So, 32 times per frame. The clamp is used to limit certain channels to a range of y-coordinates on the screen.
The fragment shader is simply:
#version 120
varying vec4 _outColor;
void main()
{
gl_FragColor=_outColor;
}
There is other stuff being render to the screen; channel labels, for example, using quads and a texture atlas; but profiling in gDEBugger seems to indicate that the line rendering takes the overwhelming majority of time per frame.
Still, 50,000 lines does not seem like a horrendously large number to me.
So, after all of that, the question is: are there any tricks to speeding up line drawing? I tried rendering them to the stencil buffer and then clipping a single quad, but that was slower. I thought about drawing the lines to a texture, the drawing a quad with the texture. But, that does not seem scalable or even faster due to uploading large textures constantly. I saw a technique that stores the y values in a single row texture, but that seems more like memory optimization rather than speed optimization.
Mapping a VBO might slow you down due to the driver might require to sync the GPU with the CPU. A more performant way is to just throw your data onto the GPU, so the CPU and GPU can run more independently.
Recreate the VBO every time, do create it with STATIC_DRAW
If you need to map your data, do NOT map as readable (GL_WRITE_ONLY)
Thanks, everyone. I finally settled on blitting between framebuffers backed by renderbuffers. Works well enough. Many suggested using textures, and I may go that route in the future if I eventually need to draw behind the data.
If you're just scrolling a line graph (GDI style), just draw the new column on the CPU and use glTexSubImage2D to update a single column in the texture. Draw it as a pair of quads and update the st coordinates to handle scrolling/wrapping.
If you need to update all the lines all the time, use a VBO created with GL_DYNAMIC_DRAW and use glBufferSubData to update the buffer.
I'm making a 2D batch renderer in OpenGL inspired by the XNA/MonoGame interface, but I've run into a small design problem and I'm looking for some input. Currently, you can submit vertex data in four general ways:
void Render(const Sprite& sprite);
void Render(const Shape& shape);
void Render(const Vertex* vertices, unsigned int length);
void Render(const Vertex* vertices, unsigned int length, const Texture* texture);
A sprite contains four vertices, color and texture coordinates while the other three can contain an arbitrary number (the sprite and shape have unique transformations). Everything can be textured or untextured. I want to batch everything to reduce the number of state changes and OpenGL draw calls. I feel it reasonable to assume that most submissions will have shared vertices so that I can use glDrawElements instead of glDrawArrays, but I have trouble figuring out how to batch things properly given what I described above.
The XNA/MonoGame sprite batchers work because they work solely with textured quads/triangles and not arbitrary shapes. Alternatively, I could do like the SFML renderer and issue a draw call for each drawable object, but that defeats the purpose of batch rendering.
I feel like my renderer is trying to "do everything" which is something I want to avoid since it usually quickly becomes too complex in my experience.
What I'm asking is: How could I redesign my renderer? Could I keep separate batch lists for different submissions? Could I modularize my renderer somehow? Should I just allow only textured objects as done in XNA/MonoGame?
Alright, so we need to minimize the number of state changes and draw calls. I'm assuming you're using modern OpenGL, including Vertex Buffer Objects and shaders.
One approach is to ensure that all vertex data has the same format. For example, each vertex has a position, color and texture coordinate (xyz, rgba, uv). If we interleave our vertex data in a VBO, we only need a single call to glVertexAttribPointer and glEnableVertexAttribArray, before rendering.
This means some redundant data for untextured objects, but we get to cram everything into a single batch, which is nice.
To handle the untextured objects, you could either bind a blank white texture and treat it as a textured object. Or, you could have a uniform variable (a float between 0 and 1) in your fragment shader, and blend between texture color and vertex color using the mix function.
To batch sprites and shapes we should first handle the transformations on the CPU, so that we always upload "world"-coordinates to the GPU. This saves us from having to set a transformation uniform for each sprite, which would each be require individual draw calls.
Furthermore, we need to sort by texture whenever possible, as texture bindings are among the more expensive operations you can do.
Our approach basically boils down to the following:
Maintain a single Vertex- and Index Buffer Object to store the data
Keep all vertex data in a single format and interleave the data in the VBO
Sort by texture
Flush the data (draw elements/arrays) in the buffers whenever we change texture (or set the texture-blend uniform, if we go with that option)
Getting the data from the CPU to GPU memory can be done in different ways. For example, by first allocating a large enough, empty memory buffer on the GPU, and using glBufferSubData to upload a subset of vertex/index data, whenever you do one of your Render calls.
Remember to profile!
It is very important to do profiling when doing this kind of work. For example, to compare the performance between batching and individual draw calls, or glDrawArrays vs glDrawElements. I recommend using gDebugger, which is a free and very good OpenGL profiler.
Also note that too big of a VBO can hurt your performance. So keep it to a reasonable size, and flush it with a draw call whenever it fills up.