Speeding up the drawing of rotated and scaled images in OpenGL - c++

I am experimenting with several ways to draw a lot of sprites (e.g. for particle system) and I have some inconclusive results. So this is what I tried and what I have:
This is done drawing 25k sprites:
Using regular glBegin/glEnd and using trig to calculate vertex points - 17-18fps.
Using regular glBegin/glEnd, but using glRotate, glTranslate and glScale to transform the sprite - 14-15fps.
Using vertex arrays instead of glBegin and glEnd, but still using trig to calculate vertex point position - 10-11fps.
Using vertex arrays instead of glBegin and glEnd, but using glRotate, glTranslate and glScale to transform the sprite - 10-11fps.
So my question is, why is using vertex arrays slower than using glBegin/glEnd while I have read (here even) that it should be faster?
And why does using your own trigonometry (which in my case is 5 cos, 5 sin, more than 5 divisions, 15 multiplications and about 10 additions/subtractions) is faster than using 5 functions (glPushMatrix(), glTranslated(), glRotated(), glScaled(), glPopMatrix()). I though they are done on the GPU so it should be much, much faster.
I do get more promising results when drawing less sprites. Like when I draw 10k sprites, then vertex arrays can be about 5fps faster, but still inconsistent. Also note than these fps can be increased overall because I have other calculations going on, so I am not really looking at the fps itself, but the difference between them. Like if vertex arrays and gl transform was 5-10fps more than glBegin/glEnd with manual trig, then I would be happy, but for now, it just doesn't seem to be worth the hassle. They would help with porting to GLES (as it doesn't have glBegin/glEnd), but I guess I will make a separate implementation for that.
So is there any way to speed this up without using geometry shaders? I don't really understand them (maybe some great tutorial?), and they could break compatibility with older hardware, so I want to squeeze all the juice I can without using shaders.

So my questions are why does using vertex arrays is slower than using glBegin/glEnd while I have read (here even) that it should be faster?
Who says that they are slower?
All you can say is that, for your particular hardware, for your current driver, glBegin/glEnd are slower. Have you verified this on other hardware?
More importantly, there is the question of how you are drawing these. Do you draw a single sprite from the vertex array, then draw another, then draw another? Or do you draw all of them with a single glDrawArrays or glDrawElements call?
If you're not drawing all of them in one go (or at least large groups of them at once), then you're not going as fast as you should be.
And why does using your own trigonometry (which in my case is 5 cos, 5 sin, more than 5 divisions, 15 multiplications and about 10 additions/subtractions) is faster than using 5 functions (glPushMatrix(), glTranslated(), glRotated(), glScaled(), glPopMatrix()). I though they are done on the GPU so it should be A LOT faster.
Well, let's think about this. glPushMatrix costs nothing. glTranslated creates a double-precision floating-point matrix and then does a matrix multiply. glRotated does at least one sin and one cos, does some additions and subtractions to compute a matrix (all in double-precision) and then does a matrix multiply. glScaled computes a matix, and does a matrix multiply.
Each "does a matrix multiply" consists of 16 floating-point multiplies and 12 floating-point adds. And since you asked for double-precision math, you can forget about SSE vector math or whatever; this is doing standard math. And you're doing 3 of these for every point.
What happens on the GPU is the multiplication of that matrix with the vertex positions. And since you're only passing 4 positions before changing the matrix, it's not particularly surprising that this is slower.

Have you considered using glPoints...() instead? This is kinda what they were designed to do, depending on which version of OpenGL you are supporting.

Have you tried VBO's instead? They're the current standard, so most cards are optimized in their favor.
Also:
you should use your own math calculations
consider offloading as much calculation as possible to a shader
The fps amounts you posted are contrary to what one might expect -- you probably do something wrong. Can you paste some of your rendering code?

Do you have a specific reason to use double precision matrix functions? They are usually a lot slower than single precision ones.

Related

OpenGL: Are degenerate triangles in a Triangle Strip acceptable outside of OpenGL-ES?

In this tutorial for OpenGL ES, techniques for optimizing models are explained and one of those is to use triangle strips to define your mesh, using "degenerate" triangles to end one strip and begin another without ending the primitive. http://www.learnopengles.com/tag/degenerate-triangles/
However, this guide is very specific to mobile platforms, and I wanted to know if this technique held for modern desktop hardware. Specifically, would it hurt? Would it either cause graphical artifacts or degrade performance (opposed to splitting the strips into separate primatives?)
If it causes no artifacts and performs at least as well, I aim to use it solely because it makes organizing vertices in a certain mesh I want to draw easier.
Degenerate triangles work pretty well on all platforms. I'm aware of an old fixed-function console that struggled with degenerate triangles, but anything vaguely modern will be fine. Reducing the number of draw calls is always good and I would certainly use degenerates rather than multiple calls to glDrawArrays.
However, an alternative that usually performs better is indexed draws of triangle lists. With a triangle list you have a lot of flexibility to reorder the triangles to take maximum advantage of the post-transform cache. The post-transform cache is a hardware cache of the last few vertices that went through the vertex shader, the GPU can spot if you've re-issued the same vertex and skip the entire vertex shader for that vertex.
In addition to the above answers (no it shouldn't hurt at all unless you do something mad in terms of the ratio of real triangles to the degenerates), also note that the newer versions of OpenGL and OpenGL ES (3.x or higher) APIs support a means to insert breaks into index lists without needing an actual degenerate triangle, which is called primitive restart.
https://www.khronos.org/opengles/sdk/docs/man3/html/glEnable.xhtml
When enabled you can encode "MAX_INT" for the index type, and when detected that forces the GPU to restart building a new tristrip from the next index value.
It will not cause artifacts. As to "degrading performance"... relative to what? Relative to a random assortment of triangles with no indexing? Yes, it will be faster than that.
But there are plenty of other things one can do. For example, primitive restarting, which removes the need for degenerate triangles. Then there's using ordered lists of triangles for improved cache coherency. Will triangle strips be faster than that?
It rather depends on what you're rendering, how expensive your vertex shaders are, and various other things.
But at the end of the day, if you care about maximum performance on particular platforms, then you should profile for each platform and pick the vertex data based on what platform you're running on. If performance is really that important to you, then you're going to have to put forth some effort.

Depth vs Position

I've been reading about reconstructing a fragment's position in world space from a depth buffer, but I was thinking about storing position in a high-precision three channel position buffer. Would doing this be faster than unpacking the position from a depth buffer? What is the cost of reconstructing position from depth?
This question is essentially unanswerable for two reasons:
There are several ways of "reconstructing position from depth", with different performance characteristics.
It is very hardware-dependent.
The last point is important. You're essentially comparing the performance of a texture fetch of a GL_RGBA16F (at a minimum) to the performance of a GL_DEPTH24_STENCIL8 fetch followed by some ALU computations. Basically, you're asking if the cost of fetching an addition 32-bits per fragment (the difference between the 24x8 fetch and the RGBA16F fetch) is equivalent to the ALU computations.
That's going to change with various things. The performance of fetching memory, texture cache sizes, and so forth will all have an effect on texture fetch performance. And the speed of ALUs depends on how many are going to be in-flight at once (ie: number of shading units), as well as clock speeds and so forth.
In short, there are far too many variables here to know an answer a priori.
That being said, consider history.
In the earliest days of shaders, back in the GeForce 3 days, people would need to re-normalize a normal passed from the vertex shader. They did this by using a cubemap, not by doing math computations on the normal. Why? Because it was faster.
Today, there's pretty much no common programmable GPU hardware, in the desktop or mobile spaces, where a cubemap texture fetch is faster than a dot-product, reciprocal square-root, and a vector multiply. Computational performance in the long-run outstrips memory access performance.
So I'd suggest going with history and finding a quick means of computing it in your shader.

Why Do I Need to Convert Quaternion to 4x4 Matrix When Uploading to the Shaders?

I have read several tutorials about skeletal animation in OpenGL, they all seem to be single minded in using quaternions for rotation, 3d vector for translation, so not matrices.
But when they come to the vertex skinning process, they combine all of the quaternions and 3d vectors into a 4x4 matrix and upload the matrices to do the rest of calculations in shaders. 4x4 matrices have 16 elements while quaternion + 3d vector has only 7. So why are we converting these to 4x4 matrices before uploading ?
Because with having only two 4×4 matrices, one for each bone a vertex is assigned and weighted to, you have to do only two 4-vector 4×4-matrix multiplications and a weighted sum.
In contrast to this, if you'd submit as a separate quaternion and translation you'd have to do the equvalent of two 3-vector 3×3-matrix multiplications plus four 3-vector 3-vector additions and a weighted sum. Either you first convert your quaternion into a rotation matrix first, then to 3-vector 3×3-matrix multiplication, or you do direct 3-vector quaternion multiplication, the computational effort is about the same. And after that you have to postmultiply with the modelview matrix.
It's perfectly possible to use a 4-element vector uniform as a quaternion, but then you have to chain a lot of computations in the vertex shader: First rotate the vertex by the two quaternions, then translate it and then multiply it with the modelview matrix. By simply uploading two transformation matrix which are weighted in the shader, you save a lot of computations on the GPU. Doing the quaternion-matrix multiplication on the CPU performs the calculation only one time per bone, whereas doing it in the shader performs it for each single vertex. GPUs are great if you have to to a lot of identical computations with varying input date. But they suck if you have to calculate only a handfull of values, which are reused over large amounts of data. CPUs however love this kind of task.
The nice thing about homgenous transformations represented by 4×4 matrices is, that a single matrix can contain a whole transformation chain. If you separate rotations and translations, you have to perform the whole chain of operations in order. With only one rotation and translation it's less operations than a single 4×4 matrix transform. Add one single transformation and you've reached the break even.
The transformation matrices, even in a skeletal pose applied to a mesh, are identical for all vertices. Say the mesh has 100 vertices around a pair of bones (this is a small number, BTW), then you'd have to to the computations outlined above for each any every vertex, wasting precious GPU computation cycles. And for what? To determine some 32 scalar values (or 8 4-vectors). Now compare this: 100 4-vectors (if you only consider vertex position) vs. only 8. This is the order of magnitude of calculation overhead imposed by processing quaternion poses in the shader. Compute it once on the CPU and give it the GPU precalculated to share among the primitives. If you code it right, the whole calculation of a single matrix column will nicely fit into the CPUs pipeline, making is vastly outperform every attempt at parallelizing it. Parallelization doesn't come for free!
In modern GPUs there is no restriction to what data format you upload to constant buffers.
Of course you need to write your vertex shader differently in order to use quaternions for skinning instead of matrices. In fact, we are using dual quaternion skinning in our engine.
Note that older fixed function hardware skinning indeed only worked with matrices, but that was a long time ago.

Efficiency when rearranging arrays in OpenGL

Imagine I have a vertex array and an index array. Suppose it is possible to rearrange the elements in the index array so that a single call glDrawElements with GL_TRIANGLE_STRIP draws the desired figure. Another possibility would be to call glDrawElements with GL_TRIANGLES but this would make the index array longer.
Does it really matter in terms of efficiency (I mean real efficiency, not some microoptimizations) which way I choose or is the underlying routine the same anyway?
A side note: The reason I am reluctant to rearrange my elements to use GL_TRIANGLE_STRIP is that because I think that the triangles in the strip will have alternating winding. Am I wrong?
There's no really performance differences between a call with GL_TRIANGLE_STRIP and GL_TRIANGLES.
Now if you can rearrange your indices for maximizing post-transform vertex cache, you can have huge performance gains. I did so years ago for rendering large terrain patches and observed 10 to 15 times FPS speedups (using some kind of Hilbert curve scheme).
I think that GL_TRIANGLE_STRIP and GL_TRIANGLES are quite as efficient, if your indices are ordered in a way to maximize vertex transform cache. Of course it will take more memory to store.
There's probably not much of a performance difference. But it's still recommended that you not use either method in the main render loop. Instead use display lists. I've converted some of my OpenGL code to display lists, and it's way faster because you end up cutting out a ton of CPU->GPU communication.
There's a tutorial here: http://www.lighthouse3d.com/opengl/displaylists/
It all depends on what your driver has to do with your vertex data. If it has to do some processing (turn QUADS into TRIANGLES, like it usually has to do) then it will not be optimal. The only way to see what is optimal for your driver is to measure. Find a opengl benchmark and see which vertex primitives are optimal for your driver.
Since you want to compare GL_TRIANGLE_STRIP and GL_TRIANGLES, most likely you will find the performance loss to be minimal in this case.

Power of two textures

Can you explain me, why hardware acceleration required for a long time textures be power of two? For PCs, since GeForce 6 we achieved npot textures with no-mips and simplified filtering. OpenGL ES 2.0 also supports npot textures without mipmaps and etc. What is the hardware restriction for this? Just simplified arithmetics?
I imagine it has to do with being able to use bitwise shift-left operations, rather than multiplication to convert an (x, y) coordinate to a memory offset from the start of the texture. So yes, simplified arithmetics, from a processor point of view.
I'm guessing that it was to make mipmap generation easier, because it allows you to just average 2x2 pixels into one pixel all the way from NxN down to 1x1.
Now that doesn't matter if you're not using mipmapping, but it's easier to have just one rule, and I think that mipmapping was the more common use case.