I am working on a project that requires drawing a lot of data as it is acquired by an ADC...something like 50,000 lines per frame on a monitor 1600 pixels wide. It runs great on a system with a 2007-ish Quadro FX 570, but basically can't keep up with the data on machines with Intel HD 4000 class chips. The data load is 32 channels of 200 Hz data received in batches of 5 samples per channel 40 times per second. So, in other words, the card only needs to achieve 40 frames per second or better.
I am using a single VBO for all 32 channels with space for 10,000 vertices each. The VBO is essentially treated like a series of ring buffers for each channel. When the data comes in, I decimate it based on the time scale being used. So, basically, it tracks the min/max for each channel. When enough data has been received for a single pixel column, it sets the next two vertices in the VBO for each channel and renders a new frame.
I use glMapBuffer() to access the data once, update all of the channels, use glUnmapBuffer, and then render as necessary.
I manually calculate the transform matrix ahead of time (using an orthographic transform calculated in a non-generic way to reduce multiplications), and the vertex shader looks like:
#version 120
varying vec4 _outColor;
uniform vec4 _lBound=vec4(-1.0);
uniform vec4 _uBound=vec4(1.0);
uniform mat4 _xform=mat4(1.0);
attribute vec2 _inPos;
attribute vec4 _inColor;
void main()
{
gl_Position=clamp(_xform*vec4(_inPos, 0.0, 1.0), _lBound, _uBound);
_outColor=_inColor;
}
The _lBound, _uBound, and _xform uniforms are updated once per channel. So, 32 times per frame. The clamp is used to limit certain channels to a range of y-coordinates on the screen.
The fragment shader is simply:
#version 120
varying vec4 _outColor;
void main()
{
gl_FragColor=_outColor;
}
There is other stuff being render to the screen; channel labels, for example, using quads and a texture atlas; but profiling in gDEBugger seems to indicate that the line rendering takes the overwhelming majority of time per frame.
Still, 50,000 lines does not seem like a horrendously large number to me.
So, after all of that, the question is: are there any tricks to speeding up line drawing? I tried rendering them to the stencil buffer and then clipping a single quad, but that was slower. I thought about drawing the lines to a texture, the drawing a quad with the texture. But, that does not seem scalable or even faster due to uploading large textures constantly. I saw a technique that stores the y values in a single row texture, but that seems more like memory optimization rather than speed optimization.
Mapping a VBO might slow you down due to the driver might require to sync the GPU with the CPU. A more performant way is to just throw your data onto the GPU, so the CPU and GPU can run more independently.
Recreate the VBO every time, do create it with STATIC_DRAW
If you need to map your data, do NOT map as readable (GL_WRITE_ONLY)
Thanks, everyone. I finally settled on blitting between framebuffers backed by renderbuffers. Works well enough. Many suggested using textures, and I may go that route in the future if I eventually need to draw behind the data.
If you're just scrolling a line graph (GDI style), just draw the new column on the CPU and use glTexSubImage2D to update a single column in the texture. Draw it as a pair of quads and update the st coordinates to handle scrolling/wrapping.
If you need to update all the lines all the time, use a VBO created with GL_DYNAMIC_DRAW and use glBufferSubData to update the buffer.
Related
I'm trying to take advantage of a gpu's parallelism to make an image proccessing application. I'm having a shader, which takes two textures, and based on some uniform variables, computes an output texture. But instead of transparency alpha value, each texture pixel needs an extra metadata byte, mandatory in computation:
So I consider running the shader twice each frame, once to compute the Dynamic Metadata as a single byte texture, and once to calculate the resulting Paint Texture, which I need to be 3 bytes (to limit memory usage, as there might be quite some such textures loaded at once).
I find the above problem a bit complicated, I've used opengl to paint to
the screen, but I need to paint to two different textures this time,
which I do not know how to do. Besides, gl_FragColor built-in variable's
type is vec4, but I need different output values.
So, to sum it up a little, is it possible for the fragment shader to output
anything other than a vec4?
Is it possible to save to two different textures with a single call?
Is it possible to make an editable texture to store changes, until the editing ends and the data have to be passed back to the cpu?
What openGL calls would be most usefull for the above?
Paint texture should also be able to be retrieved to be shown on the screen.
The above could very easily be done via blitting textures on the cpu.
I could keep all the relevant data on the cpu, do all the work 60 times/sec,
and update the relevant texture by passing the data from the cpu to the gpu.
For changing relatively small regions of a texture each frame
(about ~20% of the total scale of about 512x512 size textures), would you consider the above approach worth the trouble?
It depends on which version of OpenGL you use.
The latest OpenGL 4+ does not have a gl_FragColor variable, and instead lets you write any number (up to supported maximum) of output colors from the fragment shader, each sent to the corresponding framebuffer color attachment:
layout(location = 0) out vec4 OUT0;
layout(location = 1) out float OUT1;
That will write OUT0 to GL_COLOR_ATTACHMENT0 and OUT1 to GL_COLOR_ATTACHEMENT1 of the currently bound framebuffer.
However, considering that you use gl_FragColor, you use some old version of OpenGL. I'm not proficient in the legacy older OpenGL versions, but you can check out whether your implementation supports the GL_ARB_draw_buffers extension and/or gl_FragData[] output variable.
Also, as stated, it's unclear why can't you use a single RGBA texture and use its alpha channel for that metadata.
I am trying to make a particle system where instead of a texture, a quad is rendered by a fragment shader such as below.
uniform vec3 color;
uniform float radius;
uniform float edge;
uniform vec2 position;
uniform float alpha;
void main()
{
float dist = distance(gl_FragCoord.xy, position);
float intensity = smoothstep(dist-edge, dist+edge, radius);
gl_FragColor = vec4(color, intensity*alpha);
}
Each particle is an object of a c++ class that wraps this shader and all the variables together and draws it. I use openFrameworks so the exact openGL calls are hidden from me.
I've read that usually particle systems are done with textures, however, I prefer to do it like this because this way I can add more functionality to the particles. The problem is that after only 30 particles, the framerate drops dramatically. Is there a more efficient way of doing this? I was thinking of maybe putting the variables for each particle into an array and sending these arrays into one fragment shader that then renders all particles in one go. But this would mean that the amount of particles would be fixed because the uniform arrays in the shader would have to be declared beforehand.
Are non-texture-based particle systems simply too inefficient to be realistic, or is there a way of designing this that I'm overlooking?
The reason textures are used is because you can move the particles using the GPU, which is very fast. You'd double buffer a texture which stores particle attributes (like position) per texel, and ping-pong data between them, using a framebuffer object to draw to them and the fragment shader to do the computation, rendering a full screen polygon. Then you'd draw an array of quads and read the texture to get the positions.
Instead of a texture storing attributes you could pack them directly into your VBO data. This gets complicated because you have multiple vertices per particle, but can still be done a number of ways. glVertexBindingDivisor (requires instancing), drawing points, or using the geometry shader come to mind. Transform feedback or image_load_store could be used to update VBOs with the GPU instead of textures.
If you move particles with the CPU, you also need to copy the data to the GPU every frame. This is quite slow, but nothing like 30 particles being a problem slow. This is probably to do with the number of draw calls. Each time you draw something there's a tonne of stuff GL does to set up the operation. Setting uniform values per-primitive (nearly) is very expensive for the same reason. Particles work well if you have arrays of data that gets processed by a manager all at once. They parallelize very well in such cases. Generally their computation is cheap and it all comes down to minimizing memory and keeping good locality.
If you want to keep particle updating CPU side, I'd go with this:
Create a VBO full of -1 to 1 quads (two triangles, 6 verts) and element array buffer to draw them. This buffer will remain static in GPU memory and is what you use to draw the particles all at once with a single draw call.
Create a texture (could be 1D) or VBO (if you choose one of the above methods) that contains positions and particle attributes that update pretty much every frame (using glTexImage1D/glBufferData/glMapBuffer).
Create another texture with particle attributes that rarely update (e.g. only when you spawn them). You can send updates with glTexSubImage1D/glBufferSubData/glMapBufferRange.
When you draw the particles, read position and other attributes from the texture (or attributes if you used VBOs) and use the -1 to 1 quads in the main geometry VBO as offsets to your position.
I'm using some standard GLSL (version 120) vertex and fragment shaders to simulate LIDAR. In other words, instead of just returning a color at each x,y position (each pixel, via the fragment shader), it should return color and distance.
I suppose I don't actually need all of the color bits, since I really only want the intensity; so I could store the distance in gl_FragColor.b, for example, and use .rg for the intensity. But then I'm not entirely clear on how I get the value back out again.
Is there a simple way to return values from the fragment shader? I've tried varying, but it seems like the fragment shader can't write variables other than gl_FragColor.
I understand that some people use the GLSL pipeline for general-purpose (non-graphics) GPU processing, and that might be an option — except I still do want to render my objects normally.
OpenGL already returns this "distance calculation" via the depth buffer, although it's not linear. You can simply create a frame buffer object (FBO), attach colour and depth buffers, render to it, and you have the result sitting in the depth buffer (although you'll have to undo the depth transformation). This is the easiest option to program provided you are familiar with the depth calculations.
Another method, as you suggest, is storing the value in a colour buffer. You don't have to use the main colour buffer because then you'd lose your colour or have to render twice. Instead, attach a second render target (texture) to your FBO (GL_COLOR_ATTACHMENT1) and use gl_FragData[0] for normal colour and gl_FragData[1] for your distance (for newer GL versions you should be declaring out variables in the fragment shader). It depends on the precision you need, but you'll probably want to make the distance texture 32 bit float (GL_R32F and write to gl_FragData[1].r).
- This is a decent place to start: http://www.opengl.org/wiki/Framebuffer_Object
Yes, GLSL can be used for compute purposes. Especially with ARB_image_load_store and nvidia's bindless graphics. You even have access to shared memory via compute shaders (though I've never got one faster than 5 times slower). As #Jherico says, fragment shaders generally output to a single place in a framebuffer attachment/render target, and recent features such as image units (ARB_image_load_store) allow you to write to arbitrary locations from a shader. It's probably overkill and slower but you could also write your distances to a buffer via image units .
Finally, if you want the data back on the host (CPU accessible) side, use glGetTexImage with your distance texture (or glMapBuffer if you decided to use image units).
Fragment shaders output to a rendering buffer. If you want to use the GPU for computing and fetching data back into host memory you have a few options
Create a framebuffer and attach a texture to it to hold your data. Once the image has been rendered you can read back information from the texture into host memory.
Use an CUDA, OpenCL or an OpenGL compute shader to write the memory into an arbitrary bound buffer, and read back the buffer contents
I just watching my animated sprite code, and get some idea.
Animation was made by altering tex coords. It have buffer object, which holds current frame texture coords, as new frame requested, new texture coords feed up in buffer by glBufferData().
And what if we pre-calculate all animation frames texture coords, put them in BO and create Index Buffer Object with just a number of frame, which we need to draw
GLbyte cur_frames = 0; //1,2,3 etc
Now then as we need to update animation, all we need is update 1 byte (instead of 4 /quad vertex count/ * 2 /s, t/ * sizeof(GLfloat) bytes for quad drawing with TRIANGLE_STRIP) frame of our IBO with glBufferData, we don't need hold any texture coords after init of our BO.
I am missing something? What are contras?
Edit: of course your vertex data may be not gl_float just for example.
As Tim correctly states, this depends on your application, let us talk some numbers, you mention both IBO's and inserting texture coordinates for all frames into one VBO, so let us take a look at the impact of each.
Suppose a typical vertex looks like this:
struct vertex
{
float x,y,z; //position
float tx,ty; //Texture coordinates
}
I added a z-component but the calculations are similar if you don't use it, or if you have more attributes. So it is clear this attribute takes 20 bytes.
Let's assume a simple sprite: a quad, consisting of 2 triangles. In a very naive mode you just send 2x3 vertices and send 6*20=120 bytes to the GPU.
In comes indexing, you know you have actually only four vertices: 1,2,3,4 and two triangles 1,2,3 and 2,3,4. So we send two buffers to the GPU: one containing 4 vertices (4*20=80 byte) and one containing the list of indices for the triangles ([1,2,3,2,3,4]), let's say we can do this in 2 byte (65535 indices should be enough), so this comes down to 6*2=12 byte. In total 92 byte, we saved 28 byte or about 23%. Also, when rendering the GPU is likely to only process each vertex once in the vertex shader, it saves us some processing power also.
So, now you want to add all texture coordinates for all animations at once. First thing you have to note is that a vertex in indexed rendering is defined by all it's attributes, you can't split it in an index for positions and an index for texture coordinates. So if you want to add extra texture coordinates, you will have to repeat the positions. So each 'frame' that you add will add 80 byte to the VBO and 12 byte to the IBO. Suppose you have 64 frames, you end up with 64*(80+12)=5888byte. Let's say you have 1000 sprites, then this would become about 6MB. That does not seem too bad, but note that it scales quite rapidly, each frame adds to the size, but also each attribute (because they have to be repeated).
So, what does it gain you?
You don't have to send data to the GPU dynamically. Note that updating the whole VBO would require sending 80 bytes or 640 bits. Suppose you need to do this for 1000 sprites per frame at 30 frames per second, you get to 19200000bps or 19.2Mbps (no overhead included). This is quite low (e.g. 16xPCI-e can handle 32Gbps), but it could be worth wile if you have other bandwidth issues (e.g. due to texturing). Also, if you construct your VBO's carefully (e.g. separate VBO's or non-interleaved), you could reduce it to only updating the texture-part, which is only 16 byte per sprite in the above example, this could reduce bandwidth even more.
You don't have to waste time computing the next frame position. However, this is usually just a few additions and few if's to handle the edges of your textures. I doubt you will gain much CPU power here.
Finally, you also have the possibility to simply split the animation image over a lot of textures. I have absolutely no idea how this scales, but in this case you don't even have to work with more complex vertex attributes, you just activate another texture for each frame of animation.
edit: another method could be to pass the frame number in a uniform and do the calculations in your fragment shader, before sampling. Setting a single integer uniform should be that much of an overhead.
For a modern GPU, accessing/unpacking single bytes is not necessarily faster than accessing integer types or even vectors (register sizes & load instructions, etc.). You can just save memory and therefore memory bandwidth, but I wouldn't expect this to give much of a difference in relation to all other vertex attribute array accesses.
I think, the fastest way to supply a frame index for animated sprites is either an uniform, or if multiple sprites have to be rendered with one draw call, the usage of instanced vertex attrib arrays. With the latter, you could provide a single index for fixed-size subsequences of vertices.
For example, when drawing 'sprite-quads', you'd have one frame index fetch per 4 vertices.
A third approach would be a buffer-texture, when using instanced rendering.
I recommend a global (shared) uniform for time/frame index calculation, so you can calculate the animation index on the fly within you shader, which doesn't require you to update the index buffer (which then just represents the relative animation state among sprites)
I'm in need of rendering an influence map in OpenGL. At present I have 100 x 100 quads rendering with a set color to represent the influence at each point on the map. I've been recommended to change my rendering method to one quad with a texture, then allowing the rendering pipeline to take over in speed.
Basic testing has shown that glTexSubImage2D is too slow for setting 10,000 texels per frame. Do you have any suggestions? Would it better to create an entirely new texture each frame? My influence map is in normalized floats (0.0 to 1.0) and that is converted to grayscale colors (1.0f = white).
Thanks :D
Are you currently updating each of the 10000 texels separately, with 10000 calls of glTexSubImage2D?
Just use one 100x100 grayscale float texture (array of 10000 floats) in RAM, update values directly to that and then send the whole data to GPU with one glTexImage2D call. You could also use buffer objects to allow the transfer happen on background, but it should be unnecessary since you are not moving very large amounts of data.