texture(...) via textureoffset(...) performance in glsl - opengl

Does utilizing textureOffset(...) increase performance compared to calculating offsets manually and using regular texture(...) function?
As there is a GL_MAX_PROGRAM_TEXEL_OFFSET property, I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects, but I cant seem to find out how it works internally anywhere?
Update:
Reformulating question: is it common among gl-drivers to make any optimizations regarding texture fetches when utilizing the textureOffset(...) function?

You're asking the wrong question. The question should not be whether the more specific function will always have better performance. The question is whether the more specific function will ever be slower.
And there's no reason to expect it to be slower. If the hardware has no specialized functionality for offset texture accesses, then the compiler will just offset the texture coordinate manually, exactly like you could. If there is hardware to help, then it will use it.
So if you have need of textureOffsets and can live within its limitations, there's no reason not to use it.
I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects
No, that's textureGather. textureOffset is for doing exactly what its name says: accessing a texture based on a texture coordinate, with an texel offset from that coordinate's location.
textueGather samples from multiple neighboring texels all at once. If you need to read a section of a texture to do bluring, textureGather (and textureGatherOffset) are going to be more useful than textureOffset.

Related

What is the proper way to use the buffer's content on the cpu side in OpenGL

Let's imagine that I have two squeres. Firstly I generate the VAO, VBO, then bind it and so on... My goal is to check the collision between the two objects in every frame. In this case, I have to know the exact verticies both the cpu and the gpu side. So I store every single vertex twice. If I work with large amount of data, the mirroring not seems to be efficient, not to mention the logistic about keeping the data consistent. Is there a better way to do this? Or this is totally OK, to keep the verticies in an array after the glBufferData call?
There needs to be more information on your part. Is this something you plan on instancing? Are you sure there's a bottleneck on your bandwidth?
If this is something that you can simulate on the GPU, then just do it all on the GPU so you keep the memory on that side and not incur the transfer penalty from CPU to GPU.
If you need it to be on the CPU side for your collision detection, then you have a few options:
Update the ones that change. If all of them are changing you should ignore this option, but you could map the buffer and update it and try to flush it only after you've updated what ranges are needed.
Send a displacement. If you end up having a ton of data, you may be able to get away with just sending a rotation and central position to cut down on updating "every vertex", and might be able to exploit a Geometry Shader... however I've read that these can be problematic for performance so you should consider it but be ready to profile.
You can possibly stream data if you have to update all of them, see this wonderful resource.
You need to define your problem domain a bit more because I'm not sure exactly what the bounds on the problem are. The above are some ways of tackling these problems, but the best solution can only be given to you if you are more specific with what you want when you talk about the large case.
You also have to understand that asking for massive data manipulation and fast transfer tend to be topics that fight each other, and you'll have to be smarter about what you plan on doing depending on exactly how much data you are talking about here.
I'd like to answer with something more concrete but I'm just shooting in the dark because I don't know exactly what the limit of your data is and what hardware you're working with.

Multiple buffers vs single buffer?

I'm drawing simple 3D shapes and I was wondering in the long run is it better to only use 1 buffer to store all the data of your vertices?
Right now I have arrays of vertex data (positions and colors, per vertex) and I am pushing them to their own separate buffers.
But if I use stride and offset, I could join them into one array but that would become messier and harder to manage.
What is the "traditional" way of doing this?
It feels much cleaner and organized to have separate buffers for each piece of data, but I would imagine it's less efficient.
Is the efficiency increase worth putting it all into a single buffer?
The answer to this is highly usage-dependent.
If all your vertex attributes are highly volatile or highly static, you would probably benefit from interleaving and keeping them all together, as mentioned in the comments.
However, separating the data can yield better performance if one attribute is far more volatile than others. For example, if you have a mesh where you're often changing the vertex positions, but never the texture coordinates, you might benefit from keeping them separate: you would only need to re-upload the positions to the video card, instead of the whole set of attributes. An example of this might be a CPU-driven cloth simulation.
It is also hardware and implementation dependent. Interleaving isn't helpful everywhere, but I've never heard of it having a negative impact. If you can use it, you probably should.
However, since you can't properly interleave if you split the attributes, you're essentially comparing the performance impacts of two unknowns. Will interleaving help on your target hardware/drivers? Will your data benefit from being split? The first there's no real answer to. The second is between you and your data.
Personally, I would suggest just using interleaved single blocks of vertex attributes unless you have a highly specialized need. It cuts the complexity, as opposed to needing to have potentially different systems mixed together in the same back end.
On the other hand, setting up interleaving is a rather complex task as far as memory addressing goes in C++. If you're not designing an entire graphics engine from scratch, I really doubt it's worth the effort for you. But again, that's up to you and your application.
In theory, though, merely grouping together the data you were going to upload to the video card regardless should have little impact. It might be slightly more efficient to group all the attributes together due to reducing the number of calls, but that's again going to be highly driver-dependent.
Unfortunately, I think the simple answer to your question boils down to: "it depends" and "no one really knows".

glUniform vs. single draw call performance

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.

which is the most optimal and correct way to drawing many different dynamic 3D models (they are animated and change every frame)

I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.

GPGPU - effective ping-pong technique?

I'm trying to implement effective fluid solver on the GPU using WebGL and GLSL shader programming.
I've found interesting article:
http://http.developer.nvidia.com/GPUGems/gpugems_ch38.html
See: 38.3.2 Slab Operations
I'm wondering if this technique of enforcing boundary conditions is possible with ping-pong rendering?
If I render only lines, what about an interior of the texture?
I've always assumed that the whole input texture must be copied to temporary texture (ofc boundary is updated during that process), as they are swapped after that operation.
This is interesting especially considering the fact, that Example 38-5. The Boundary Condition Fragment Program (visualization: http://i.stack.imgur.com/M4Hih.jpg) shows scheme that IMHO requires ping-pong technique.
What do you think? Do I misunderstand something?
Generally I've found that texture write is extremely costly and that's why I would like to limit it somehow. Unfortunately, the ping-pong technique enforces a lot of texture writes.
I've actually implemented the technique described in that chapter using FrameBuffer objects as the render to texture method (but in desktop OpenGL since WebGL didn't exist at the time), so it's definitely possible. Unfortunately I don't believe I have the code any more, but if you tag any future questions you have with [webgl] I'll see if I can provide some help.
You will need to ping-pong several times per frame (the article mentions five steps, but I seem to recall the exact number depends on the quality of the simulation you want and on your exact boundary conditions). Using FBOs is quite a bit more efficient than it was when this was written (the author mentions using a GeForce FX 5950, which was a while ago), so I wouldn't worry about the overhead he mentions in the article. As long as you aren't bringing data back to the CPU, you shouldn't find too high a cost for switching between texture and the framebuffer.
You will have some leakage if your boundaries are only a pixel thick, but that may or may not be acceptable depending on how you render your results and the velocity of your fluid. Making the boundaries thicker may help, and there are papers that have been written since this one that explore different ways of confining the fluid within boundaries (I also recall a few on more efficient diffusion/pressure solvers that you might check out after you have this version working...you'll find some interesting follow ups if you search for papers that cite the GPU gems article on google scholar).
Addendum: I'm not sure I entirely understand your question about boundaries. The key is that you must run a shader at each pixel of what you want to be a boundary, but it doesn't really matter how that pixel gets there, whether it's drawn with lines, points, or triangles (as long as its inputs are correct).
In the very general case (which might not apply if you only have a limited number of boundary primitives), you will likely have to draw a framebuffer-covering quad, since the interactions with the velocity and pressure fields are more complicated (any surrounding pixel could be another boundary pixel, instead of having simply defined edges). See section 38.5.4 (Arbitrary Boundaries) for some explanation of how to do it. If something isn't a boundary, you won't touch the vector field, and if it is, instead of hardcoding which directions you want to look in to sum vector values, you'll probably end up testing the surrounding pixels and only summing the ones that aren't boundaries so that you can enforce the boundary conditions.